In year 1, [the Good Judgment Project] beat the official control group by 60%. In year 2, we beat the control group by 78%. GJP also beat its university-affiliated competitors, including the Uniersity of Michigan and MIT, by hefty margins, from 30% to 70%, and even outperformed professional intellgience analysts with access to classified data. After two years, GJP was doing so much better than its academic competitors that IARPA dropped the other teams.
I keep wondering what these other teams were doing. Good Judgment Project sounds like it was doing the simplest, most obvious possible tactic – asking people to predict things and seeing what happened. David Manheim says the other groups tried “more straightforward wisdom of crowds” methods, so maybe GJP’s secret sauce was concentrating on the best people instead of on everyone? Still seems like it should have taken fewer than five universities and a branch of government to think of that.
One result that particularly surprised me was the effect of a tutorial covering some basic concepts that we’ll explore in this book and are summarized in the Ten Commandments appendix. It took only about sixty minutes to read and improved accuracy by roughly 10% through the entire tournament year. Yes, 10% may sound modest, but it was achieved at so little cost.
These Ten Commandments are available online here.
For centuries, [aversion to measuring things and collecting evidence] hobbled progress in medicine. When physicians finally accepted that their experience and perceptions were not reliable means of determining whether a treatment works, they turned to scientific testing – and medicine finally started to make rapid advances.
I see what Tetlock is trying to say here, but as written it’s horribly wrong.
Evidence-based medicine could be fairly described as starting in the 1970s with Cochrane’s first book, and really took off in the 80s and 90s. But this is also the period when rapid medical advances started slowing down! In my own field of psychiatry, the greatest advances were the first antidepressants and antipsychotics in the 50s, the benzodiazepines in the 60s, and then a gradual trickle of slightly upgraded versions of these through the 70s and 80s. The last new drugs that could be called “revolutionary” by any stretch of the imagination were probably the first SSRIs in the early 80s. This is the conventional wisdom of the field and everybody admits this, but I would add the stronger claim that the older medications in many ways work better. I know less about the history of other subfields, but they seem broadly similar – the really amazing discoveries are all pre-EBM, and the new drugs are mostly nicer streamlined versions of the old ones.
There’s an obvious “low-hanging fruit” argument to be made here, but some people (I think Michael Vassar sometimes toys with this idea) go further and say that evidence-based medicine as currently practiced can actually retard progress. In the old days, people tried possible new medications in a very free-form and fluid way that let everyone test their pet ideas quickly and keep the ones that worked; nowadays any potential innovations need $100 million 10-year multi-center trials which will only get funded in certain very specific situations. And in the old days, a drug would only be kept if it showed obvious undeniable improvement in patients, whereas nowadays if a trial shows a p < 0.05, d = 0.10 advantage, that's enough to make it the new standard if it's got a good pharma company behind it. So the old method allowed massive-scale innovation combined with high standards for success; the new method only allows very limited innovation but keeps everything that can show the slightest positive effect whatsoever on an easily-rigged but very expensive test. I'm not sure I believe in the strong version of this argument (the low-hanging fruit angle is probably sufficient), but the idea that medicine only started advancing after the discovery of evidence-based medicine is just wrong. A better way of phrasing it might be that around that time we started getting fewer innovations, but we also became a lot more effective and intelligent at using the innovations we already had.
Consider Galen, the second-century physician to Rome’s emperors…Galen was untroubled by doubt. Each outcome confirmed he was right, no matter how equivocal the evidence might look to someone less wise than the master. “All who drink of this treatment recover in a short time, except those whom it does not help, who all die,” he wrote. “It is obvious, therefore, that it fails only in incurable cases.”
After hearing one too many “everyone thought Columbus would fall off the edge of the flat world” -style stories, I tend to be skeptical of “people in the past were hilariously stupid” anecdotes. I don’t know anything about Galen, but I wonder if this was really the whole story.
When hospitals created cardiac care units to treat patients recovering from heart attacks, Cochrane proposed a randomized trial to determine whether the new units delivered better results than the old treatment, which was to send the patient home for monitoring and bed rest. Physicians balked. It was obvious the cardiac care units were superior, they said, and denying patients the best care would be unethical. But Cochrane was not a man to back down…he got his trial: some patients, randomly selected, were sent to the cardiac care units while others were sent home for monitoring and bed rest. Partway through the trial, Cochrane met with a group of the cardiologists who had tried to stop his experiment. He told them that he had preliminary results. The difference in outcomes between the two treatments was not statistically signficant, he emphasized, but it appeared that patients might do slightly betteri n the cardiac care units. “They were vociferous in their abuse: ‘Archie,’ they said, ‘we always thought you were unethical. You must stop the trial at once.'” But then Cochrane revealed he had played a little trick. He had reversed the results: home care had done slightly better than the cardiac units. “There was dead silence and I felt rather sick because they were, after all, my medical colleagues.”
This story is the key to everything. See also my political spectrum quiz and the graph that inspired it. Almost nobody has consistent meta-level principles. Almost nobody really has opinions like “this study’s methodology is good enough to believe” or “if one group has a survival advantage of size X, that necessitates stopping the study as unethical”. The cardiologists sculpted their meta-level principles around what best supported their object-level opinions – that more cardiology is better – and so generated the meta-level principles “Cochrane’s experiment is accurate” and “if one group has a slight survival advantage, that’s all we need to know before ordering the experiment stopped as unethical.” If Cochrane had (truthfully) told them that the cardiology group was doing worse, they would have generated the meta-level principles “Cochrane’s experiment is flawed” and “if one group has a slight survival advantage that means nothing and it’s just a coincidence”. In some sense this is correct from a Bayesian point of view – I interpret sonar scans of Loch Ness that find no monsters to be probably accurate, but if a sonar scan did find a monster I’d wonder if it was a hoax – but in less obvious situations it can be a disaster. Cochrane understood this and so fed them the wrong data and let them sell him the rope he needed to hang them. I know no better solution to this except (possibly) adversarial collaboration. Also, I suppose this is more proof (as if we needed it) that cardiologists are evil.
In the late 1940s, the Communist government of Yugoslavia broke from the Soviet Union, raising fears that the Soviets would invade. In March 1951 [US intelligence under Sherman Kent reported there was a “serious possibility” of a Soviet attack.] But a few days later, Kent was chatting with a senior State Department official who casually asked, “By the way, what did you people mean by the expression ‘serious possibility’? What kind of odds did you have in mind?” Kent said he was pessimistic. He felt that the odds were about 65 to 35 in favor of an attack. The official was startled. He and his colleagues had taken “serious possibility” to mean much lower odds.
Disturbed, Kent went back to his team. They had all agreed to use “serious possibility” in the [report], so Kent asked each person, in turn, what he thought it meant. One analyst said it meant odds of about 80%. Another thought it meant odds of 20% – exactly the opposite. Other answers were scattered between those extremes. Kent was floored. A phrase that looked informative was so vague as to be almost useless…
In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint Chiefs of Staff concluded that the plan had a “fair chance” of success. The man who wrote the words “fair chance” later said he had in mind odds of 3 to 1 against. But Kennedy was never told precisely what “fair chance” meant and, not unreasonably, he took it to be a much more positive assessment.
…
Nate Silver, Princeton’s Sam Wang, and other poll aggregators were hailed for correctly predicting all fifty state outcomes, but almost no one noted that a crude, across-the-board prediction of “no change” – if a state went Democratic or Republican in 2008, it will do the same in 2012 – would have scored forty-eight out of fifty, which suggests that the many excited exclamations of “he called all fifty states!” we heard at the time were a tad overwrought.
I didn’t realize this. I think this election I’m going to predict the state-by-state results just so that I can tell people I “predicted 48 of the 50 states” or something and sound really impressive.
The [Expert Political Judgment] data revealed an inverse correlation between fame and accuracy: the more famous an expert was, the less accurate he was. That’s not because editors, producers, and the public go looking for bad forecasters. They go looking for hedgehogs, who just happen to be bad forecasters. Animated by a Big Idea, hedgehogs tell tight, simple, clear stories that grab and hold audiences.
One day aliens are going to discover humanity and be absolutely shocked we made it past the wooden-club stage.
In 2008, the Office of the Director of national Intelligence – which sits atop the entire network of sixteen intelligence agencies -asked the National research Council to form a committee. The task was to synthesize research on good judgment and help the IC put that research to good use. By Washington’s standards, it was a bold (or rash) thing to do. It’s not every day that a bureaucracy pays one of the world’s most respected scientific institutions to produce an objective report that might conclude that the bureaucracy was clueless.
This was a big theme of the book: the US intelligence community deserves celebration for daring to investigate its own competency at all. Interestingly, a lot of its investigations said it was doing things more right than we would think: Tetlock mentions that even independent-to-hostile investigators concluded that it had been correct in using the facts it had to believe Saddam had WMDs. The book didn’t explain exactly how this worked: possibly Saddam was trying to deceive everyone into thinking he had WMDs to prevent attacks, and did a good job? This was part of what got the intelligence community interested in probability: given that they had made a reasonable decision in saying there were WMDs, but it had been a big disaster for the United States, what could they have done differently? Their answer was “continue to make the reasonable decision, but learn to calibrate themselves well enough to admit there’s a big chance they’re wrong.”
[We finished by giving] the forecast a final tweak: “extremizing” it, meaning pushing it closer to 100% or zero. If the forecast is 70% you might bump it up to, say, 85%. If it’s 30%, you might reduce it to 15%…[it] is based on a pretty simple insight: when you combine the judgments of a large group of people to calculate the “wisdom of the crowd” you collect all of the relevant information that is dispersed among all those people. But none of those people has access to all that information…what would happen if every one of those people were given all the information? They would become more confident. If you then calculated the wisdom of the crowd, it too would be more extreme.
Something to remember if you’re doing wisdom-of-crowds with calibration estimates.
The correlation between how well individuals do from one year to the next is about 0.65…Regular forecasters scored higher on intelligence and knowledge tests than about 70% of the population. Superforecasters did better, placing higher than about 80% of the population.
People interested in taking these kinds of tests are generally intelligent; superforecasters are somewhat more, but not vastly more, intelligent than that.
Researchers have found that merely asking people to assume their initial judgment is wrong, to seriously consider why that might be, and then make another judgment, produces a second estimate which, when combined with the first, improves accuracy almost as much as getting a second estimate from another person.
There’s a rationalist tradition – I think it started with Mike and Alicorn – that before you get married, you ask all your friends to imagine that the marriage failed and tell you why. I guess if you just asked people “Will our marriage fail?” everyone would say no, either out of optimism or social desirability bias. If you ask “Assume our marriage failed and tell us why”, you’ll actually hear people’s concerns. I think this is the same principle. On the other hand, I’ve never heard of anyone trying this and deciding not to get married after all, so maybe we’re just going through the motions.
[Superforecaster] Doug Lorch knows that when people read for pleasure they naturally gravitate to the like-minded. So he created a database containing hundreds of information sources – from the New York Times to obscure blogs – that are tagged by their ideological oreintation, subject matter, and geographical origin, then wrote a program that selects what he should read next using criteria that maximize diversity.
Of all humans, only Doug Lorch is virtuous. Well, Doug Lorch and this guy from rationalist Tumblr who tried to get the program but was told it wasn’t really the sort of thing you could just copy and give someone.
[The CIA was advising Obama about whether Osama bin Laden was in Abbotabad, Pakistan; their estimates averaged around 70%]. “Okay, this is a probability thing,” the President said in response, according to Bowden’s account. Bowden editorializes: “Ever since the agency’s erroneous call a decade earlier [on Saddam’s weapons of mass destruction], the CIA had instituted an almost comically elaborate process for weighing certainty…it was like trying to controve a mathematical formula for good judgment.”Bowden was clearly not impressed with the CIA’s use of numbers and probabilities. Neither was Barack Obama, according to Bowden. “What you ended up with, as the president was finding, and as he would later explain to me, was not more certainty but more confusion…in this situation, what you started to get was probabilities that disguised uncertainty, as opposed to actually providing you with useful information…”
After listening to the widely ranging opinions, Obama addressed the rrom. “This is fifty-fifty,” he said. That silenced everyone. “Look guys, this is a flip of the coin. I can’t base this decision on the notion that we have any greater certainty than that…
The information Bowden provides is sketchy but it appears that the media estimate of the CIA officers – the “wisdom of the crowd” – was around 70%. And yet Obama declares the reality to be “fifty-fifty.” What does he mean by that?…Bowden’s account reminded me of an offhanded remark that Amos Tversky made some thirty years ago…In dealing with probabilities, he said, most people only have three settings: “gonna happen,” “not gonna happen,” and “maybe”.
Lest I make it look like Tetlock is being too unfair to Obama, he goes on to say that maybe he was speaking colloquially. But the way we speak colloquially says a lot about us, and there are many other examples of people saying this sort of thing and meaning it. This ties back into an old argument we had here on whether something like a Bayesian concept of probability was meaningful/useful. Some people said that it wasn’t, because everyone basically understands probability and Bayes doesn’t add much to that. I said it was, because people’s intuitive idea of probability is hopelessly confused and people don’t really think in probabilistic terms. I think we have no idea how confused most people’s idea of probability is, and perhaps even Obama, one of our more intellectual presidents, has some issues there.
Barbara Mellers has shown that granularity predicts accuracy: the average forecaster who sticks with the tens – 20%, 30%, 40% – is less accurate than the finer-grained forecaster who uses fives – 20%, 25%, 30% – and still less accurate than the even finer-grained forecaster who uses ones – 20%, 21%, 22%. As a further test, she rounded forecasts to make them less granular, so a forecast at the greatest granularity possible in the tournament, single percentage points, would be rounded to the nearest five, and then the nearest ten. This way, all of the forecasts were made one level less granular. She then recalculated Bier scores and discovered that superforecasters lost accuracy in response to even the smallest-scale rounding, to the nearest 0.05, whereas regular forecasters lost little even from rounding four times as large, to the nearest 0.2.
This was the part nobody on the comments to the last post believed, and I have trouble believing it too.
[There’s a famous Keynes quote: “When the facts change, I change my mind. What do you do, sir?”] It’s cited in countless books, including one written by me and another by my coauthor. Google it and you will find it’s all over the internet. Of all the many famous things Keynes says, it’s probably the most famous. But while researching this book, I tried to track it to its source and failed. Instead I found a post by a Wall Street Journal blogger, which said that no one has ever discovered its provenance and the two leading experts on Keynes think it is apocryphal. In light of these facts, and in the spirit of what Keynes apparently never said, I concluded that I was wrong.
The funny part is that if this fact is true, we’ve known it for fifty years, and people still haven’t changed their mind about whether he said it or not.
“Keynes is always ready to contradict not only his colleagues but also himself whenever circustancse make this seem appropriate,” re[prted a 1945 profile of the “consistently inconsistent” economist. “So far from feeling guilty about such reversals of position, he utilizes them as pretexts for rebukes to those he saw as less nimble-minded. Legend says that while conferring with Roosevelt at Quebec, Churchill sent Keynes a cable reading, ‘Am coming around to your point of view.’ His Lordship replied, ‘Sorry to hear it. Have started to change my mind.'”
I sympathize with this every time people email me to say how much they like the Non-Libertarian FAQ.
Police officers spend a lot of time figuring out who is telling the truth and who is lying, but research has found they aren’t nearly as good at it as they think they are and they tend not to get better with experience…predictably, psychologists who test police officers’ ability to spot lies in a controlled setting find a big gap between their confidence and their skill. And that gap grows as officers become more experienced and they assume, not unreasonably, that their experience has made them better lie detectors.
There’s some similar research on doctors and certain types of diagnostic tasks that don’t give quick feedback.
In 1988, when the Soviet Union was implementing major reforms that had people wondering about its future, I asked experts to estimate how likely it was that the Communist Party would lose its monopoly on power in the Soviet Union in the next five years. In 1991 the world watched in shock as the Soviet Union disintegrated. So in 1992-93 I retunred to the experts, reminded them of the question in 1988, and asked them to recall their estimates. On average, the experts recalled a number 31 percentage points higher than the correct figure. So an expert who thought there was only a 10% chance might remember herself thinking there was a 40% or 50% chance. There was even a case in which an expert who pegged the probability at 20% recalled it as 70%.
As the old saying goes, hindsight is 20/70.
The results were clear-cut each year. Teams of ordinary forecasters beat the wisdom of the crowd by about 10%. Prediction markets beat ordinary teams by about 20%. And superteams beat prediction markets by 15% to 30%. I can already hear the protests from my colleagues in finance that the only reason the superteams beat the prediction markets was that our markets lacked liquidity…they may be right. It is a testable idea, and one worth testing.
The correct way to phrase this is “if there is ever a large and liquid prediction market, Philip Tetlock will gather his superforecasters, beat the market, become a zillionaire, and then the market will be equal to or better than the forecasters.”
Orders in the Wehrmacht were often short and simple – even when history hung in the balance. “Gentlemen, I demand that your divisions completely cross the German borders, completely cross the Belgian borders, and completely cross the River Meuse,” a senior officer told the commanders who would launch the great assault into Belgium and France on May 10, 1940. “I don’t care how you do it, that’s completely up to you.”
This is the opposite of the image most people have of Germany’s World War II military. The Wehrmacht served a Nazi regime that rpeached total obedience to the dictates of the Fuhrer, and everyone emembers the old newsreels of German soldiers marching in goose-stepping unison…but what is often forgotten is that the Nazis did not create the Wehrmacht. They inherited it. And it could not have been more different from the unthinking machine we imagine.
[…]
Shortly after WWI, Eisenhower, then a junior officer who had some experience witht he new weapons called tanks, published an article in the US Army’s Infantry Journal making the modest argument that “the clumsy, awkward and snail-like progress of the old tanks must be forgotten, and in their place we must picture this speedy, reliable, and efficient engine of destruction.” Eisenhower was dressed down. “I was told my ideas were not only wrong but dangerous, and that henceforth I was to keep them to myself,” he recalled. “Particularly, I was not to publish anything incompatible with solid infantry doctrine. If I did, I would be hauled before a court martial.”
Tetlock includes a section on what makes good teams and organizations. He concludes that they’re effective when low-level members are given leeway both to pursue their own tasks as best they see fit, and to question and challenge their higher-ups. He contrasts the Wehrmacht, which was very good at this and overperformed its fundamentals in WWII, to the US Army, which was originally very bad at this and underperformed its fundamentals until it figured this out. Later in the chapter, he admits that his choice of examples might raise some eyebrows, but says that he did it on purpose to teach us to think critically and overcome cognitive dissonance between our moral preconceptions and our factual beliefs. I hope he has tenure.
Ultimately the Wehrmacht failed. In part, it was overwhelmed by its enemies’ superior resources. But it also made blunders – often because its commander-in-chief, Adolf Hitler, took direct control of operations in violation of Helmuth von Moltke’s principles, nowhere with more disastrous effect than during the invasion of Normandy. The Allies feared that after their troops landed, German tanks would drive them back to the beaches and into the sea, but Hitler had directed that the reserves could only move on his personal command. Hitler slept late. For hours after the Allies landed on the beaches, the dictator’s aides refused to wake him to ask if he wanted to order the tanks into battle.
Early to bed
And early to stir up
Makes a man healthy
And ruler of Europe
The humility required for good judgment is not self-doubt – the sense that you are untalented, unintelligent, or unworthy. It is intellectual humility. It is a recognition that reality is profoundly complex, that seeing things clearly is a constant struggle, when it can be done at all, and that human judgment must therefore be riddled with mistakes. This is true for fools and geniuses alike. So it’s quite possible to think highly of yourself and be intellectually humble. In fact, this combination can be wonderfully fruitful. Intellectual humility compels the careful reflection necessary for good judgment; confidence in one’s abilities inspires determined action.
Yes! This is a really good explanation of Eliezer Yudkowsky’s Say It Loud.
(and that sentence would also have worked without the apostrophe or anything after it).
I am…optimistic that smart, dedicated people can inoculate themselves to some degree against certain cognitive illusions. That may sound like a tempest in an academic teapot, but it has real-world implications. If I am right, organizations will have more to gain from recruiting and training talented people to resist their biases.
This is probably a good time to mention that CFAR is hiring.
To answer your first question, it’s the extremizing algorithm that made them more successful than other groups, IIRC. If two different people predict that the probability of an event occurring is 70%, the actual probability of the event occurring is greater than 70%, because of consilience. The algorithm used by the Good Judgment Project exploits this, while the other groups’ methods do not.
Where do you recall this from?
I think there are stats in the paper by Tetlock and Pavel Anatasov (spelling?) comparing GJP to prediction markets. Should be findable on Google Scholar.
IDK where 27chaos knows it from, but I came down here to post the same thing, recalling it from an episode of EconTalk (http://www.econtalk.org/archives/2015/12/philip_tetlock.html)
I think you are misreading that. I think that says that extremizing was the best method that they came up with, but that all of their methods (including their own prediction market) trounced the other teams.
Edge has a series of discussion videos on their website, with Tetlock and a bunch of random people from various industries.
http://edge.org/conversation/philip_tetlock-edge-master-class-2015-a-short-course-in-superforecasting-class-i
Transcript below, it’s discussed early on.
Yes, they talk about extremizing there, but they don’t say it was their advantage over other teams. In Part II they say that the other teams lost because of “mismanagement.”
Scott mentioned that “David Manheim says the other groups tried “more straightforward wisdom of crowds” methods, so maybe GJP’s secret sauce was concentrating on the best people instead of on everyone?”
The Edge link says that “If you were running a forecasting tournament over an extended period of time and you had, say, 500-plus questions and thousands of forecasters, and you have estimates of diversity and accuracy over long periods of time, you can work out algorithms that do a better job of distilling the wisdom of the crowd than, say, simple averaging.”
I put the two concepts next to each other. I thought the flow was pretty natural inside my head, but I can understand why it wouldn’t make much sense as I presented it. The point I was trying to make is that Manheim was likely referring to extremizing.
Yes. The point I was making was that GJP didn’t just pick good people and aggregate them, picking their average; that is straightforward wisdom of crowds. Instead, they harnessed the data from predictions and built a model using that as inputs. I had assumed the book explained some of this, but I haven’t read it yet.
For example, in addition to “extremizing”, they took advantage of the fact that updates over time show trends; you can update the aggregate estimates more heavily based on those forecasters who submit updates.
(But the term consilience is unrelated, and estimates may be correlated; I don’t think extremizing as explained in the comment captures what they were doing.)
Interesting that they adjusted for trends as well.
Consilience seems like the right word to me, but maybe I’m misunderstanding how the extremizing algorithm works?
My understanding is that extremizing worked really well but *not* for superforecasters. For them, extremizing only gave negligible gains.
I don’t follow this argument. Suppose I have a weird-shaped coin, and I gather hundreds of scientists to analyze its shape and determine the probability that flipping it will give heads. Each of them independently gives an answer that’s close to 70%. I think this is pretty good evidence that the probability of the coin landing heads is indeed about 70%. Why would it be higher?
It could be higher if different experts used independent information sources to reach the 70% conclusion. Suppose one expert analyses the metallurgical qualities of the coin and determines there is a 70% of it landing heads, and another expert analyses the aerodynamic qualities of the coin and determines a 70% probability. You want your prediction to account for both the metallurgical and aerodynamic qualities of the coin, so your prediction would be higher than 70%.
Each scientist flips the coin twice and reports a probability of 75%. You conclude that each scientist saw two heads and used Laplace’s law of succession, (2+1)/(2+2). You conclude that the coin lands heads >99.5% of the time.
I’m confused, shouldn’t your resulting probability for the coin landing heads next time be 5/6? Since you’ve essentially observed the coin coming up heads 4 times out of 4.
“Hundreds of scientists” was specified, not just two.
https://en.wikipedia.org/wiki/Consilience
See note two in particular. Of course, since all the forecasters are somewhat similar to each other (they are all human, of a similar age, have an interest in forecasting, etc) their errors were probably not entirely uncorrelated with each other. But apparently there’s enough difference between people that extremizing resulted in improvements.
I suspect the nature of the task was to predict one off events like whether a country will go to war in a given year or if a depression will occur.
In the coin flip example we assume that, even though it may be in principle determined, in practice there is an upper bound on the amount of information we can get about it’s result (it’s bias) and that actual experts may be able to exhaust that information.
Attempts to predict world events are just the opposite. There doesn’t seem to be any practical upper bound on how certain we can be about an outcome (surely there are such upper bounds but any serious expert won’t be generating probabilities even remotely bumping up against them) and the extent of the information about these events is so large that no expert could fully exhaust the predictive information in the publicly available data.
This creates a situation where each expert is focusing in on some predictive information and ignoring others. Thus, if two experts each reach a 70% probability estimate while each ignored information the other relied on the (assuming, plausibly, that factors predicting an event usually interact positively) it makes sense to estimate a greater than 70% chance of the event.
The difficulty is figuring out just how much more since it will likely vary by the type of problem being considered.
I participated in one of the other teams’ efforts — namely DAGGRE (a George Mason affiliated team, advised by Robin Hanson). It was a prediction market, with the innovative addition of exploiting Robin’s “combinatorial” prediction market concept, i.e. it allowed re-using assets when betting on conditional probabilities.
I think DAGGRE performed respectably, but it’s worth mentioning that IARPA set up the tournament in a way that was extremely biased against prediction markets. (Especially “payment for accuracy” — i.e. participation incentives were OK, but not accuracy incentives.) In fact, a “straight up” prediction market was essentially banned as a forecast aggregation method, and I think DAGGRE was only permitted because of the combinatorial market maker being an untested twist on the concept. I don’t know anything that happened behind the scenes, but my suspicion is that IARPA (presumably reflecting the opinion at CIA) went into the tournament knowing that they wanted the outcome to be “it is possible to beat a prediction market using some sort of experts”. This, in turn, is an attitude that makes sense in the context of Robin’s previous experience running a policy market (“terrorism futures!”) and the ensuing political backlash.
After reading some of Tetlock et al’s papers on the tournament, my conclusion is that they have fairly credible statistical evidence that the supers can beat a relatively crappy, illiquid, play-money PM. But the “extremization” algorithm was a key component of this, and it’s not entirely clear whether one can apply a similar algorithmic correction to a PM’s forecasts to improve them to a level that is equal to or superior to the supers. (The algorithm would have to be different in nature. Namely, while surveying people and averaging leads to underconfident forecasts, because Bayes, prediction markets suffer from different flaws, like the favorite longshot bias.) Actually, to caveat this caveat, Tetlock has also said that extremization is less important for teams of supers, essentially because they self-extremize (i.e. when they talk to each other while forecasting, they update based upon one another’s evidence if the lines of reasoning are semi-independent). So maybe if you use Tetlock’s state-of-the-art methods (a team of top supers, working in concert), the degree to which they beat a PM is more clear-cut.
Anyhow, you can ask the DAGGRE team (Robin, Charles Twardy, Kathryn Lasky) for more info about the institutional background. Perhaps now that their funding (and the funding for their successor project Scicast) seems to have been fully cut off, they will be able to share more about their experience with the project.
My guess is that a subsidized, combinatorial prediction market (using real money, so you get rewarded for accuracy) populated by a large enough pool of normal traders (not just supers) with access only to OSINT, will easily outperform CIA experts with access to classified information. But as a policy tool, this will never be adopted in any meaningful way, so it’s probably good for the USA that IARPA has found in Tetlock’s approach a methodology that gives comparable accuracy gains while being more politically palatable.
I was also an active member of the DAGGRE team, and I agree it was a shame that they weren’t allowed to pay for performance.
In many ways, players on the DAGGRE market did not act rationally, and shared relevant information in the comment sections. I could certainly see group forecasting outperforming a prediction market simply because of the information sharing.
I also recall that DAGGRE had some bug in their code for a while that caused them to submit stale/inaccurate forecasts to IARPA. When they discovered the problem, IARPA said no, you cannot change your past submissions. This also hurt the DAGGRE team a little.
Honestly, my own opinion is that the system used, whether it’s prediction markets or Delphi groups or whatever, is secondary to your population of forecasters. Tetlock’s team did great mainly because they recruited and motivated great forecasters.
But scicast did pay for accuracy, didn’t it? Did that help compared to when it didn’t? It didn’t look to me like it helped.
You can read the final report at http://blog.scicast.org/download/scicast-final-report-public/?wpdmdl=2573
I don’t think the study was designed in a manner that allows one to easily answer the question “did accuracy incentives help?” But Table 8 on page 53 shows that in some sense the answer is yes: accuracy incentives led to more edits and more information elicited, albeit less information per edit (suggesting many noise traders were incentivized to trade as well).
I don’t have strong evidence to back this up, but personally I believe yes, SciCast’s accuracy incentives did indeed help.
I participated in GJP rather than DAGGRE simply because GJP started recruiting well before DAGGRE. If I had known of both when signing up I would have chosen DAGGRE.
So part of GJP’s advantage appears to have been organizational competence.
That is unquestionably true, and something Tetlock emphasizes in public discussion of GJP’s performance relative to its competitors. It’s also the sort of take-away that the intelligence community sponsors must love: “The key to getting good results is effective management.”
I don’t entirely understand what IARPA’s motivations were. Being against PMs is sensible, but it doesn’t explain my own experience with them. For the first two GJP seasons or so, I had been crossposting my trades as predictions to Prediction Book, because why not? They were all legitimate predictions and it would be nice to have my own records. I stopped when I got an upset email from GJP, who had gotten an angry email from IARPA, asking who was posting all the contracts online and would they please stop it ASAP?
The fact that they were comparing different approaches could easily be undermined if there was a third party source of aggregated information. At the very least, it could cause bleeding of information between groups that should otherwise have been independent.
I wouldn’t say the tournament looked biased against markets or anything. Sure there were no rewards but survey forecasters also got none – and points are a much less delayed gratification than Brier scores since you can get them before the question closes. And the IARPA control group was a prediction market. And GJP ran several prediction markets internally. In year 4 there was not only the “regular” prediction market, but also a “supermarket” with around 130 participants mostly self-selected from the already quite active superforecaster crowd. (How many *active* traders did DAGGRE ever have by the way? Not just signed up but making daily trades?) In the end the supermarket did beat most of the survey teams but not the aggregate across teams.
It was definitely biased against markets, insofar as the best (and arguably the only valid) way to test PMs in this context would be to allow payment for accuracy. This was explicitly banned (see page 11 at http://www.iarpa.gov/images/files/programs/ace/ACE_Proposers_Day_Brief.pdf).
But the survey forecasters weren’t paid for accuracy, either. And despite the lack of payments, the markets were apparently liquid enough to achieve a rather formidable performance. I mean, sure prediction markets are nice in the sense that you can easily prove theorems about them, but that alone doesn’t mean everything else must necessarily not work 🙂
I assume IARPA wanted to avoid a repeat of this https://en.m.wikipedia.org/wiki/Policy_Analysis_Market
Basically, a prediction market for policy was proposed and several Senators lambasted it as offensive. No point in discovering how effective an option is if it is not an option.
Tetlock mentions that even independent-to-hostile investigators concluded that it had been correct in using the facts it had to believe Saddam had WMDs. The book didn’t explain exactly how this worked: possibly Saddam was trying to deceive everyone into thinking he had WMDs to prevent attacks, and did a good job?
Or possibly the Iraqi Army was trying to deceive Saddam into thinking he had WMDs, because Saddam wanted WMDs and it was easier to lie to him than to actually build the WMDs. IIRC, it was pretty clear in the aftermath of the war that the Iraqis generally had been running a deliberate and competent bluff on the WMD issue, but not clear whether Saddam personally was in on it – just about everybody involved had motive to lie about what they were up to.
But it does raise the issue of how to make predictions when someone who is likely to be as clever as you may be trying to deliberately mislead you on the subject.
My method for predicting whether we’d find WMDs served me well: “The UN inspector in-country says there aren’t any, so there aren’t. The US president who actually wants the war says his intelligence agency tells him they’re there, so they’re not.”
Unfortunately, the method doesn’t generalize very well.
Would that be the UN inspector the Iraqis refused access to certain sites and then threw out of the country?
As to the “there were no WMDs” trope, it’s only partially true. What is true is that we found the scattered remnants of old WMD programs which had been disrupted or discontinued in the mid-to-late-’90s. What Iraq did not have in ’03 was an ongoing development program that was actively producing new weapons. US forces found vast troves of hidden research, chemical artillery shells and buried technical machinery, just nothing current. The fear that Saddam was about to create a nuke or a smallpox bioweapon was not founded in fact, but every time I hear people say “there were no WMDs”, I am reminded how much of what people know just isn’t so.
I tend to assume such people just mean “there were no nukes.” The term WMD seems to have been deliberately designed to imply “nukes” so it’s not surprising that some people took it to mean that. (Chemical weapons don’t literally cause “mass destruction”; nukes do).
If we could taboo the term WMD and just say “nukes” when we’re talking about nukes or “chemical weapons” when we’re talking about chemical weapons, most of this confusion and miscommunication should disappear.
“WMD” was meant to refer to weapons whose effects cannot be confined to military targets and are almost certain to cause heavy civilian casualties even if used with the greatest caution. Canonically, nuclear explosives, war gasses (but not an assassin’s poison), live-agent or area-effect biological agents, area-effect radiological weapons (but not a bit of polonium in someone’s tea), and anything else along those lines that some clever mad scientist might come up with in the future. Antimatter bombs, redirected asteroids, gray-goo nanotech, etc.
This is a useful category to have for certain purposes. The Outer Space Treaty, for example, says that you can’t put any of these sort of things in space to threaten bombardment of the Earth, even if we do allow that the Space Cops will eventually need weapons of some sort. And prior to 2011, the Great Powers could with some credibility claim that if anyone anywhere used any of these weapons for anything short of World War III, we would end their regime, no questions asked (but it might take a few years before we got around to it).
But you’re right that the ambiguity allows propagandists to give the impression that someone they don’t like is about to nuke the Good Guys, when that is clearly not the case. For best results, always read “WMD” to mean “Mustard Gas” unless and until proven otherwise.
This is helpful, but it makes me wonder whether the term WMD, like so much in politics, wasn’t chosen intentionally for its vagueness. Like, if Pentagon advisers hear WMD and think “mustard gas,” the guy on the street probably hears WMD and thinks “nukes.” If we had said, however, that we must overthrow Sadaam because he may have mustard gas, I don’t think people would have been able to get as worked up over that.
Mustard gas, sure, but also Sarin, or VX – there are some scary chemical weapons out there, and picking one that sounds much less scary seems like a strawman argument.
““WMD” was meant to refer to weapons whose effects cannot be confined to military targets and are almost certain to cause heavy civilian casualties even if used with the greatest caution. Canonically, nuclear explosives, war gasses (but not an assassin’s poison), live-agent or area-effect biological agents, area-effect radiological weapons (but not a bit of polonium in someone’s tea), and anything else along those lines that some clever mad scientist might come up with in the future.”
I can’t agree here; war gasses were most widely used in the First World War and I have never read a reference to them causing a single civilian casualty. Maybe they did cause more than zero, but don’t see it any more a necessary effect of them than artillery shells, rifles, or bayonets, and I’d be willing to bet on very good odds that any of those have caused far more civilian casualties than gas.
“Weapon of Mass Destruction” is similar in meaning “assault rifle”: more precisely stated, “weapon people find scary”. Generally people who know very little about weapons.
As such the claim that Saddam had “WMD” was always rather meaningless. Saddam had the ability to kill a lot of civilians, and did so. Saddam had the ability to attack his neighbours including Israel, and did so. Saddam didn’t have the ability to do these things against US opposition and win, which was also well known at the time.
So the idea that Saddam had to be stopped because he was developing dangerous weapons never really made any sense. He would’ve had to have been developing something like a large centrifuge program for producing nuclear bombs and ballistic missiles to have been a real threat, and he couldn’t do that, didn’t have the resources, and no one at the time alleged he was doing that or could do it. If it had been nuclear weapons, we would have remembered, “There are no nuclear weapons in Iraq.”, not the much weaker, “There are no WMDs in Iraq”. Meanwhile Iran on is publicly doing exactly that, and for all the bluster ultimately no one cares enough to do very much about it.
So the Iraq war wasn’t about weapons. It was probably about belief that
“This is helpful, but it makes me wonder whether the term WMD, like so much in politics, wasn’t chosen intentionally for its vagueness. Like, if Pentagon advisers hear WMD and think “mustard gas,” the guy on the street probably hears WMD and thinks “nukes.” If we had said, however, that we must overthrow Sadaam because he may have mustard gas, I don’t think people would have been able to get as worked up over that.”
Yeah, except it is disarmament NGOs, not he Pentagon who invented the term. They wanted to be able to advocate banning CS gas and have people think “Nukes!”.
“Weapon of Mass Destruction” is similar in meaning “assault rifle”: more precisely stated, “weapon people find scary”.
“Assault rifle” is a legitimate term. It’s “assault weapon” that’s meaningless
There were many UN arms inspectors in-country, but I’m guessing this is a reference to Hans Blix, head of UNMOVIC in 2003. From his February 2003 report to the UN Security Council:
“To take an example, a document, which Iraq provided, suggested to us that some 1,000 tonnes of chemical agent were ‘unaccounted for’. One must not jump to the conclusion that they exist. However, that possibility is also not excluded. If they exist, they should be presented for destruction. If they do not exist, credible evidence to that effect should be presented.”
Much more along the same lines – Iraq may or may not have chemical weapons, but is hiding so much that we can’t know.
The other candidate would be Scott Ritter, UNSCOM 1991-1998 and later a prominent opponent of the war. As of August 1998, his public views were
“I think the danger right now is that without effective inspections, without effective monitoring, Iraq can in a very short period of time measured in months, reconstitute chemical and biological weapons, long-range ballistic missiles to deliver these weapons, and even certain aspects of their developing of nuclear weapons. program.”
He retired at the end of the month, meaning that by the start of the war Iraq had had five years to reconstitute a program that Ritter had said could be reconstituted in a very short period of time.
If there was a prominent UN inspector on the ground in Iraq who prior to the 2003 war affirmed that Iraq did not have chemical weapons, that’s news to me and I’d like a pointer.
How did you make the text glow red when I hover over it? Looks cool.
It appears that I closed out the first link with an rather than a in the HTML. I would have thought that this would have just turned the rest of the comment into a mega-link, but apparently it just makes it turn red on a mouseover.
Ah, well, I never claimed to be an HTML wizard. One step closer, though…
All the links on this blog glow red when you hover over them, or at least all the links I’ve checked.
Yes, Hans Blix.
Additionally: When I say “There were no WMDs,” I mean those exact words. There was no A, no B, no C and no R.
When you read “No WMDs” and you infer: “He doesn’t know about the empty gas shells, or the defunct factories” that is in fact extremely uncharitable. I know about the empty gas shells and the defunct factories. They have in common this: They’re not WMDs.
Where does the thousand tonnes of unaccounted-for mustard gas and nerve agent fit into your understanding? Because those are WMDs, about three orders of magnitude more than was used in the Ghouta attack and so capable of killing millions if used in a populated area.
Hans Blix’s exact words were quoted above, and “no A, no B, no C and no R” is so gross a misrepresentation of them that I am left to wonder about either your reading comprehension or your honesty. Likewise the suggestion that I had “inferred” anything about empty gas shells or defunct factories.
Schilling, I apologize. I conflated you with with Sastan and his statement “every time I hear people say “there were no WMDs”, I am reminded how much of what people know just isn’t so” which is what inflamed me.
“Would that be the UN inspector the Iraqis refused access to certain sites and then threw out of the country?”
Disallowing an inspector from viewing certain sites could mean one of two things: (1) you actually have WMDs that you’re trying to hide, or (2) you don’t want a foreign country to have detailed information on your military installations, since that would compromise your ability to defend your country from attack.
We know that in the case of Iraq, (2) was true. And since Iraq was invaded shortly thereafter, it seems the Iraqis’ concerns were justified. For that reason, I wouldn’t put much stock in the argument that a country that denies UN inspectors access to some places must have WMDs.
In fact, that seems to be the main motivation for sending UN inspectors to hostile nations. If they let the inspectors go where they please, you get detailed intelligence about their military facilities. If they don’t let the inspectors go where they please, you can insinuate to the public that they must have something sinister to hide, raising public support for military actions against them. It’s win-win from a hawk’s perspective.
It was close enough to “no WMDs” that people who say that are much less wrong than the prewar consensus was.
The claim that “… even independent-to-hostile investigators concluded that it had been correct in using the facts it had to believe Saddam had WMDs.” is a big red flag to me. As has been discussed, there was immense career pressure for professional analysts to come up with results that supported the Iraq war. Afterwards, when no WMD’s were found, that creates a small credibility problem. The solution is CYA, sometimes now jokingly called “Who could have known?” (meaning ironically, we knew we were writing total nonsense that our bosses wanted to hear, but publicizing the truth would have been bad for us at the time). Thus a cottage industry of coming up with “explanations” as to why the original deliberately wrong prediction was allegedly reasonable given the contemporary political incentives, I mean, information available.
A gee-whiz narrative of “These weird tricks allow amateurs to beat professionals!” sells a lot better than a dour “People in power want professionals to confirm what the powers-that-be have decided to do anyway”.
If you don’t believe the independent-to-hostile investigators, perhaps you should participate in your own independent-to-hostile investigation… then you’ll find that people on the internet won’t believe you when you say you’ve actually investigated something, because after all, they’re hedgehogs.
But why should people on the Internet believe me? Isn’t that the whole problem in the first place? If people unquestionably believe some random blatherer in a blog’s comment section, we’re in trouble.
How could you tell if I actually did a serious investigation, as opposed to an exercise in motivated reasoning to support some political faction? (obviously this applies to all my comments). This is why there’s a very hard problem in predictions, and about thinking in general.
It’s not your random blog comment. It’s that when you go through a serious investigation and publish your results, people should probably pay attention, perhaps actually read the reports, and update their estimates… rather than just rely on a vague politicized notion that professionals are bullied into lying.
Can someone link to the independent-to-hostile investigators? I’m suspicious as much due to the absence of a citation as due to the reasons given by Seth.
Tetlock cites Why Intelligence Fails by Robert Jervis.
I’d expect a strong sarcastic “who could have known?!” narrative among analysts whether or not it had really happened that way.
That narrative seems like basic psychological CYA: the source of the expensively overconfident prediction was not, after all, the analysts themselves. It was their political bosses. The analysts knew the real story all along, but weren’t allowed to say so.
Maybe that’s true and maybe it’s not; I don’t think the presence of the narrative gives much evidence either way.
As has been discussed, there was immense career pressure for professional analysts to come up with results that supported the Iraq war.
Immense career pressure in the CIA, yes, and in MI6 I would assume. But there were also professional analysts working for the UN, for generally anti-war nations like France and Germany, and for the private sector. I followed them all at the time; WMD issues were and are a professional interest of mine.
Assessments ran from “Iraq certainly has chemical weapons”, through “Iraq probably has chemical weapons”, to “Iraq might have chemical weapons but they’re being too deliberately and effectively secretive for us to know”. Nobody was saying “Iraq does not have chemical weapons” or “Iraq probably does not have chemical weapons”. And unfortunately nobody was putting percentages on their predicts at the time, but translating from colloquial English gives a range of maybe 50-95% depending on the analyst. The high end dominated by the people whose bosses wanted support for a war, obviously.
Nuclear weapons, the smart money was always on “no”, and in the aftermath some of the people who had been saying “Iraq doesn’t have nuclear weapons, mumble mumble chemical mumble biological”, essentially retconned their positions to “I said Iraq never had WMD and I was right!”.
How about Scott Ritter? He didn’t say that Iraq had zero chemical weapons, but he did say that they definitely had very few.
Quoted elsewhere in thread. Scott Ritter said in 1998 that Iraq had very few chemical weapons that anyone knew of, but could rapidly reconstitute the program in a very short time. Then he resigned in protest that the UN wasn’t doing anything about it.
His statements in 2002-2003 were ambiguous at best, always opposing the invasion but sometimes because Iraq did not possess WMD and sometimes because US troops would be walking into a hellish nightmare of nerve gas as they approached Baghdad. But by this time, he was five years past being able to speak as an insider on the subject.
What, outsiders are “nobody”? Moving the goalposts.
And the fact that he was more accurate as an outsider is pretty damning.
If we are talking about “immense career pressure”, then we are by definition talking about insiders. No goalposts moved there.
If we are talking about “more accurate”, then this isn’t it:
Now chemical weapons is different. As I testified to the U.S. Senate in 1998, Iraq has the indigenous capability right now to reconstitute a chemical weapons program within a matter of weeks. And my concern is if we continue to push for military action against Iraq, and once the writing becomes clear on the wall — and believe me, if Saddam Hussein doesn’t understand that President Bush is dead serious about going to war against him now, I don’t know when he’ll be — when he’ll recognize that. But at some point, I believe that Iraq will seek to reconstitute militarized nerve agent that will be used in defense of Baghdad. And I think the Iraqi government’s efforts to acquire significant stockpiles of atropine are an indication that this is the direction that Saddam Hussein is heading.
If you make enough different predictions, and count on your fans to forget all the wrong ones, you’ll wind up with a reputation at least as sound as Nostradamus. You can even manage that if you constrain your predictions to variations on George W. Bush Is Wrong About Everything.
Yes, Ritter’s view of Iraq’s goals and capabilities were overestimates. But there is almost no contradiction between the two accounts. Nowhere in your link does he claim that Iraq had stockpiles of chemical weapons, either from long ago or from an ongoing program. The only difference is that in my link he says that they could restart in six months, while in your link he says “weeks.” My link is talking about starting from scratch, while if he really means weeks, he is probably talking about stockpiles of precursors or mothballed factories. “Reconstitute militarized nerve agent” sounds like he’s talking about precursors, but I wouldn’t read too much into an oral interview; he might have just meant “reconstitute the program.”
Now, maybe you’ll insist that precursors count as chemical weapons. There are a lot of situations in which that is reasonable. But you were insisting on narrow usage. In this context, I think most reasonable interpretation is to directly address the administration’s claim: the ongoing production of new chemical weapons.
Here’s Gregory Cochran’s analysis of why Saddam couldn’t afford a nuclear weapon program that Jerry Pournelle posted on his blog on 10/14/2002:
“As far as I can tell, exactly nothing new has happened in Iraq concerning nukes. Most likely they are getting steadily farther away from having a nuclear weapon.. Look, back in 1990, they surprised people with their calutrons. No normal country would have made such an effort, because calutrons – mass spectrometers – are an incredibly inefficient way of making a nuclear weapon. We know just how inefficient they are, because E. O Lawrence conned the government into blowing about a quarter of the Manhattan Project budget on a similar effort. Concentrating enough U-235 for one small fission bomb cost hundreds of millions of 1944 dollars. Probably the Japanese could have constructed new cities for less money than this approach took to blow them up. By far the cheaper way is to enrich the uranium just enough to run a reactor and then breed plutonium. The Iraqis wanted U-235, probably because it is much easier to make a device with U-235 than with plutonium. You don’t have to use implosion and you don’t even have to test a gun-type bomb – we didn’t test the Hiroshima bomb. . I would guess that they realized their limitations – they’re not exactly overflowing with good physicists and engineers – and chose an approach that they could actually have made work. Implosion is not so easy to make work. India only got their implosion bomb to work on the seventh try, back in 1974, and they have a _hell_ of a lot more technical talent than Iraq.
“Anyhow, Iraq doesn’t have the money to do it anymore (1). The total money going into his government is what, a fifth of what it used to be? ( Jeez, quite a bit less than that, when you look carefully) Big non-private organizations tend to gradually slide towards zero output when the money merely stays the same: cut and they fire the worker bees and keep a few Powerpoint specialists. There is no reason to think that Arabs are immune to that kind of logic of bureaucracy. On the contrary. Not only are they not making any nuclear progress, they’re probably making regress.
“At best, if we hadn’t interrupted them back in the Gulf War, they would have eventually had a couple. I doubt if it they even would have been an effective deterrent. It’s hard to make classic deterence work when you have one or two bombs and the other guy has thousands, when he can hit you and you can’t hit him.
“He would cause himself practical trouble by harboring anti-US terrorists. If they ever made a significant hit on the US, he’d be in deep shit. What would he get out of it? And I am supposed to think that he fears terrorist groups more than he fears a Trident boat?? He should appease _them_, rather than us? Look, if we really got mad, we could turn him and his entire nation into something that was no longer human. Kill them too, of course, but that’s too easy.
This particular argument is nonsense,, even if he had a little deterrent. as are all the ones that I have seen floated by the Administration or by their hangers on and flacks. It’s not as crazy as the idea that we’re going to democratize Iraq, or Iraq and then the entire Arab world – that’s about as crazy as a human can get – but it makes no sense. Anyone with a brain knows, for example, that the last thing Israel wants is democratic Arab states, because they”d be _more_ hostile than the existing governments, and possibly stronger. . People like Mubarak understand that they can’t beat the IDF, and also understand who makes the deposits in their Swiss accounts: a new popular government might not. And a popular government might have some enthusiasm to draw on – Iran did, at first, after the fall of the Shah – whereas in places like Syria or Iraq > 70% of the population hates the government.
I know why Wolfowitz wants this, and why Bill Kristol wants this. I know that most Americans have decided that Iraq was somehow responsible for 9-11, because what else would explain the Administration’s desire to attack? And so they support an attack, which would make every kind of sense if Iraq _had_ been behind 9-11 Except that everyone knows that they didn’t have anything to do with it. The problem is, I don’t understand, even slightly, why Bush and Cheney want this.
Gregory Cochran
Almost all the oil sales ( other than truck smuggling) go through the UN. ^8% of that revenue is available for buying _approved_ imports. Mainly food and other hings that we approve of. The Us has a veto on such purchases. The total amount available for those approved purchases was something like 7 billion last year. Saddam is getting under-the-table payments of something like 20 cents a barrel from some or for all I know all of the buyers: but how much cash is that? we’re talking something like 1 or 2 %” no more than 100 million a year. Sheesh. Probably the truck smuggling accounts for more. Hmm.. That might be as much as a billion. Not much cash to run a government. . It’s a little hard to for me to see how he manages to keep the show on the road at all.
The problem there is “the facts they had”. The circularity of “the administration is looking for evidence to justify going to war” and the selection pressure that put on what data and how it was analysed, and then that analysis being used to support “the evidence is clear, we should attack now or else!” policy decisions does not seem reliable to me.
The David Kelly affair in Britain is a troubling example; here was a MoD expert who had a contrary opinion of the interpretation of the “facts they had”. Whether he was correct in his view or not, it was undeniable that all the political pressure was “find us reasons to support the Americans in their invasion plans”. The Hutton Report, after the inquiry into his suicide, though generally favourable to the government and finding the BBC and the journalist culpable of false reporting, said amongst other things (bolding mine):
If you’re wondering who Alastair Campbell is, he’s the person on whom Malcolm Tucker was based (warning: coarse language, and yes, allegedly Mr Campbell was that charming and soft-spoken in real life).
See also Colin’s Powell’s lies and some redundant evidence that the CIA would try to make their past judgments look reasonable regardless of the truth.
From elsewhere on the site:
On granularity, I think the key point is related to another finding, which is that the best forecasters are the most active ones, who repeatedly return to and revise their predictions in light of new evidence. The very best forecasters do this is a relatively Bayesian way, with small updates rather than big overreactions to (most) news. At the beginning of the year you might make a pretty good forecast (20%) that candidate X will win the republican nomination. Then candidate X does however well in the Iowa caucuses and then has a mediocre debate performance. You should update on this info, but probably only a little. People who practice this for a while seem to end up making pretty small updates in most cases that cause them to use the most detailed level of granularity permitted by the interface. People who are more casual about updating and who do so more rarely are (a) less likely to be paying as much attention as supers, and (b) more likely to lazily move from 20% to 30% or 10% than to carefully apply Bayes’ rule. So I think the granularity effect is probably true, and also unsurprising.
Agreed. As a prediction market trader, I was unsurprised by the granularity finding, because I notice that many top forecasters will make small price movements with relatively high frequency/volume, whereas many novice forecasters will make big trades infrequently. In general, if you’re a higher frequency forecaster, your adjustments will tend to be smaller.
(There are other small reasons I believe it too, but I won’t type them out now.)
I would second that. Because markets trade to the level of granularity allowed, every trade will be evaluated by a successful trader in terms of “is 42 cents a good buy” rather than “is the probability of this occurring somewhere around 40 percent.” I think that simple change would make the overall finding less surprising.
On a side note, even if Scott feels that the book is “not too useful” for people like him, I would personally predict an 83% chance that Scott’s next iteration of prediction-calibrations in 2017 will take the granularity insight into account instead of his broad categories of 60%-70%-80% etc.
If that one thing improves your prediction ability significantly, I’d say the book was worth more than a high-status official symbol–or at least as much as the average seminar.
But the adversarial situation on a prediction market is completely different from forecasting your own predictions. Making a small deviation from the consensus is very different from making a precise absolute forecast. Also, with a limited budget, you have no reason to move the market price anywhere near your own beliefs. Both to save the budget for other questions and to save it for the same question on other days.
Also, I think the finding is that p(high freq | super) is much higher than p(super | high freq).
Good points.
They improved when they did that, but they were already winning in the first year, when they didn’t know who the best people were.
I wonder if the “imagine my marriage has failed and tell me why” thing isn’t trying to predict a failed marriage at all, but is rather trying to harness the power of self-negating prophesy. If you can know ahead of time why the thing you’re doing is going to fail, one response is to not do the thing, but another response is to do the thing differently in such a manner as to address the cause of failure.
Yes; I believe in business this is known as a “pre-mortem”.
That term comes from the rationality literature, not vice versa, IIRC. But see the citations in Kahneman’s thinking fast and slow, where he discusses it.
I believe the prior, more technical term is “prospective hindsight”.
Generally when people are at the point of getting married, they think they can make it work. I don’t know if advice about “this is a bad idea” ever works; even if you say “Well, George, you know you have a quick temper and Sally’s habit of leaving the top off the toothpaste is going to drive you scatty”, George thinks (a) I don’t have a temper! (b) okay, I can fix this simple personality quirk.
Doesn’t always end that way.
Only experience I had was a cousin of mine rushed into an unsuitable marriage; everyone in the family drew in their breath and went “No!” when he announced his intentions, but he wouldn’t listen and went ahead anyway and yep, ended up divorced in a couple of years.
Might George think “I can’t fix this, but I’ll try to make sure it doesn’t escalate into something me and Sally can’t handle, and make sure Sally knows I still love her even if I get upset about the leaving the top off the toothpaste – that is, it’s really just about my personality quirk, not an indication I don’t love her.” That might not succeed all the time, but it seems it worth a try going into the marriage.
That is, the idea being not that you avoid the problems entirely, but knowing where they are would help keeping them as small problems, rather than cascading into big problems.
In my experience, romantic relationships are one of the areas most impervious to outside advice ever.
Yep, few will listen at that stage. Which is a good reason for the new custom of starting with overnight dates, escalating to weekends, to vacations together, to shacking up. Then they’ll seek a relationship counselor and bring a list of known problems.
I find few will listen at any stage of romance–especially not the early stages! People are far more inclined to take to heart friends and family’s concerns about their boring old spouse of ten years then they are about the sexy new person they just started banging. In fact, the length of time the relationship has been going on probably correlates to one’s ability to evaluate it honestly and dispassionately, though it probably also correlates at least weakly with actual compatibility.
Is there a reason it skips from #8 to #10? Is there supposed to be a #9? (If there is no #9, then there are 10 total, but we are left with the question of what happened to #9. If there is a #9, there are 11 total, but this is OK because #11 is a meta-commandment and presumably doesn’t count.)
Nevermind, #9 is buried in the middle of #8; the person just forgot to break the paragraph.
The passage about the history of medicine seems to lack in nuance mostly by ignoring the complexity of the progression between Galen and evidence-based medicine.
James Hanam (phd in history of science) argues that ancient medicine could be divided into two domain: medicine proper (diagnostic and prescription) and surgery. Surgery, as a manual, artisanal activity, was considered a lowly activity by the ancient Greeks. It was thus practice by lower class artisans, was transmitted orally to apprentices, and was directly based on practice and observation. As such, it was constantly progressing, and by the middle ages, surgeons knew how to properly clean and sterilize a wound, and they could perform advanced surgical operations, like rhinoplasty.
On the other hand, medicine was a high art, taught in universities from the writings of respected scholars, and mostly fueled through the theorical reflexions of physicians. As such, it was largely ineffective: diseases were diagnosed based on the theory of humors, and remedies were based on an analogy of shape (walnuts were prescribed against headaches because walnuts look like brains). This significantly impaired any progress in medicine for most of the premodern period, especially as, throughout the middle ages and the renaissance, what had been preserved of greek writing remained the main source of scholarly knowledge in university.
Only slowly and painfully was it realised that the Ancients were not always right (the first dent was of course in the realm of theology — the Church regarded Aristotle as one of the greatest of all scholars, and yet still held him as Clearly Wrong on matters of cosmology and eschatology). This progressively affected medical knowledge; for a long time, it was deemed useless to perform human dissections, because Galen had provided detailed description of the human body which implied he had already done this work. But when dissections started to be made regularly anyway — originally for forensic purposes — it was realised that Galen’s descriptions were badly inaccurate; it turned Galen had never dissected a human body: he had dissected a number of animals and extrapolated what the content of the human body might look like from there.
It became progressively clear (in medicine and other domains) that the greek method of accessing knowledge through pure reasoning was wholy inadequate and that observation, experiment and measurement were needed — though the lack of funding and of means implied that such endaveour remained only wishful thinking for a long time.
Eventually though, the premices of germ theory started to appear in the 16th century, several experiments along with the invention of the miscroscope in the 17th century, and the development of primitive vaccination techniques in the 18th century paved the way to modern medicine, although diagnostics and treatments didn’t substantially improve until the middle of the 19th century — but from that point on the changes were spectacular.
Not that it is necessarily relevant to your medieval claims, but you, and probably James Hannam, don’t actually know anything about ancient Greek science.
Care to share what you know?
Tetlock seems to ignore things like urine examination; this wasn’t merely “diagnosis based on doctors’ personal whims”, it was an attempt at exactly the kind of evidence-based medicine he praises. Luckily, we are now past the point where the doctor has to taste the urine to find if it’s sweet and so diagnose diabetes.
But the modern quick urine dip-stick test is not too dissimilar to the uroscopy wheel, and even if you knew in the 16th century that your patient had diabetes, there wasn’t too damn much you could do about it. Progress in medicine depends on progress in general; until you can identify that insulin is involved, that the patient needs insulin, that you can safely and reproducibly obtain or synthesise insulin, work out dosages, and have a delivery system (i.e. hypodermics or ports) to deliver that insulin, an accurate diagnosis of diabetes will still not permit you to have an improved rate of patient survival.
And nowadays it seems we’re moving more to Galen’s view, rather than the “one size fits all” view of treatment; some conditions are more prevalent in certain ethnicities, some drugs will have different effects depending on age or gender or other attribute, and as we’ve discussed on here before, the anti-depressant that works wonders for Bill may do nothing for Tom and make Susie even worse. Saying “SuperPep-U-Up works in most cases but there are some for whom it does not work so switch them to a different medication” is not Galen trying to eat his cake and have it, it’s the result of observation. I don’t think any study finds “Our new drug cured 100% of the test group and had no side-effects”, it will always be “Effective in 90% of cases but some people get the very rare side effect of their liver explodes and they die”.
I skimmed Hannam’s book, looking for discussion of the ancients. He is actually pretty careful about not talking about Greek thought, because it’s not relevant to his medieval topic.
However, this talk is explicitly about ancient Greek vs medieval Christian science. I have a huge number of complaints about it. It contains many false claims about the Greeks. But all those pale to nothing before Machine Interface’s
whereas, Hannam:
So Hannam says that Greek science was all about observation, not pure thought. And diagnosis is one of the successes of Greek science, almost opposite to MI’s memory of Hannam.
So that’s Hannam. I would go further and say that the Hellenistic Greeks were far more enthusiastic about experiment than medieval Christians. Many of my examples are things that he does mention. So while he makes many factual errors, I am much more concerned that he applies double standards for this and many other comparisons.
Experiment is such a vague concept that it is not productive to argue about. But Hannam made the much more concrete claim that this putative deficiency of Greek science made it incapable of discovering useful interventions. A great counterexample to this is the work Aristotle’s successor, Theophrastus, on plants. Maybe Aristotle’s work on animals was pure observation and pure knowledge of no practical value, but Theophrastus’s books contained many practical interventions which quickly spread over the Mediterranean. This not physics and engineering, but something as messy as medicine. Medical treatment is a definite failure of Greek (and medieval!) science, but it’s just one example.
Actually, I found another talk where Hannam says something like what Machine Interface attributes to him:
But, as the other talk demonstrates, this is not out of ignorance, but out of malice.
Oy, with the whig history. The main factor which retarded the progress of medicine was evidence, specifically, that the totality of medical evidence collected prior to 1860 did nothing to support the germ theory of disease over its rivals. To illustrate this, consider a selection of diseases known at the time and their vectors:
–Typhus (bacterium in the fecal matter of the body louse)
–Malaria (protozoan parasite in the bite of the anopheles mosquito)
–Smallpox (airborne virus in the bodily fluids of infected humans)
–Syphilis (bacterium spread by sexual contact and gestation)
–Tuberculosis (airborne bacterium in the bodily fluids of infected humans, milk of infected cows)
–Rabies (virus in the bite of deranged mammals)
–Plague (bacterium in the bite of rat fleas)
–Cholera (bacterium spread by food or water contaminated with fecal matter of infected humans)
What on earth would clue you in to the fact that all of these diseases, with their diverse mechanisms of transmission, share the same cause? Why should it be microscopic pathogens, rather than filth and foul airs, as was commonly believed? What set of experiments, with no microscope capable of detecting viruses, could possibly confirm your hunch? The connection between poor sanitation and disease, in contrast, is striking, easily demonstrated, and does not require us to believe in invisible monsters. So we should dispense with the fiction that premodern physicians failed to make progress in medicine because they were incompetent, or beholden to phony wisdom of the ancients, or hostile to experiment and observation. They failed because nature guards her secrets well.
Comment of the week.
Might this suggest that, in general, people should be working on better tools?
While I have no idea what might be as big a breakthrough as microscopes, I’d love to see fast cheap quantitative chemical analysis.
This article (http://www.nature.com/news/the-top-100-papers-1.16224) implies yes. Apparently the majority of the most cited papers ever (arguably the ones that are the most influential) are describing tools or techniques. (Although this might underrepresent breakthroughs in theory, which tend to get incorporated into textbooks and no one ever reads the original paper.)
Some of it is nature guarding her secrets well, sure.
But some of it is, in fact, physicians killing their patients because they were beholden to the phony wisdom of the ancients and nobody trying even the simplest observational tests. Thus we had centuries of treating fevers through bloodletting (like Galen recommended), though a simple comparison of the dramatically different survival rates between those bled and those not bled would have revealed that it killed patients.
Something to point out in this narrative: high-status people in the past are bad, low-status people in the past are good. Notice how the commenter drops into the passive voice when, presumably, the high-status people become more correct.
Edit: spelling
The passive and active voices were used alternatingly throughout the comment.
First sentence/paragraph: active
Second paragraph: active, passive, passive, passive, passive, passive, active, active, active.
Third paragraph: active (stative verb), passive, passive, active (stative verb), passive, passive, passive, active, active, passive, active.
Fourth paragraph: passive, active (stative verb), active (stative verb), active, active, active, passive, active, active, active, passive, passive, active (stative verb), active, active, active, active, active.
Fifth paragraph: active (stative verb), active (stative verb), passive, active, active (stative verb).
Sixth paragraph: active, active, active, active (stative verb).
Verbs in the active voice: 32 (including 9 stative verbs)
Verbs in the passive voice: 16
Use of the passive voice by paragraph:
First: 0
Second: >0.56
Third: >0.55
Fourth: >0.22
Fifth: 0.2
Sixth: 0
Thus, directly opposite to the above claim, the passive voice is used mainly at the beginning of the comments rather than making an appearance at the end, and is used to refer to both upper and lower class people throughout the text.
The naive number crunching is fun, but without context, meaningless. But I suppose throwing math at something gives you the appearence of being correct? Your narrative is still high-status bad, low-status good.
“Passive voice” is not an arcane notion in English; the claim that the original comment used passive voice in specific places for motivated reasons can be evaluated objectively.
In English, the passive voice is formed by verb “to be” followed by a past participle; it indicates that the subject of the verb is the patient rather than the agent of the action.
This is a passive voice: “The mouse was eaten”. The mouse is the subject, but something is done to the mouse, rather than the mouse doing something.
These are not passive voice, but are often erroneously mistaken with it:
The verb “to be” followed by a present participle (“the mouse was eating”).
The verb “to be” followed by a non-verbal adjective (“the mouse was beautiful”).
The use of an inanimate subject with a verb normally associated with an animate one (“the doors open at 8”).
Impersonal constructions (“it rains”; “it seems that he was wrong”).
The number crunching in the previous post goes through every instance of a conjugated verb in the original post to note if they are either in the active or passive voice, and is thus easy to verify with both comments under the eye, provided the meaning of “passive voice” in English grammatical theory is correctly understood.
“The number crunching in the previous post goes through every instance of a conjugated verb in the original post to note if they are either in the active or passive voice, and is thus easy to verify with both comments under the eye, provided the meaning of “passive voice” in English grammatical theory is correctly understood.”
Still not really getting it, and this is the part where I point out that of all the replies to your comment, you’re still stuck on mine. Actually, only the second sentence of mine. And even then, you are still focusing on naively “mathing” one argument, without context.
Disagreement over historical interpretation is hard to settle, and the usefulness of a discussion doesn’t seem obvious when the opposing party starts by declaring, without further elaboration, that the cited expert knows nothing.
On the other hand, accusations of class bias based on a faulty grammatical analysis can be easily disproven.
You still don’t understand what I’m claiming. You really want it to be a claim on the ratio of active and passive verbs in each paragraph. You want this to be the entirety of my claim. That is not what I claimed. Blindly throwing math at something is a habit you have to train undergrads out of, and is a good habit to have in general. But at this point you’re failing a Turing test, so this is written for the general audience, I’m just going to assume you don’t get it right now.
If the use of passive voice isn’t important, then it should not have been used as a supporting argument which turned out to be wrong.
Having the charity to ignore that glaring mistake (along with the refusal to aknowledge it and insidious personal attacks when that refusal is highlighted) doesn’t change much: the accusation of general class bias ignores that this was an argument specifically about medicine, and not about all fields of knowledge and competence in general.
Historical evidence is that in the pre-modern period, physicians were at best useless and at worst actually worsened their patients’ chances, whereas surgeons were competent and efficient. This is a particular case about a particular field and doesn’t say anything about the status of the knowledge of upper and lower class people in other fields.
@ Machine_Interface
The use of an inanimate subject with a verb normally associated with an animate one (“the doors open at 8”).
Good addition to a good list.
Keep in mind that this was because of very *strong* Greek cultural taboos about the treatment of dead bodies. Remember how Achilles insults Hector by dragging his body around the city? That was a *big deal.* Dissection of human cadavers was not something you could just *do.*
It wasn’t until Egyptian philosophers in Alexandria (who had no such cultural issues) started experimenting (usually on the bodies of the condemned… *usually* after the sentence had been carried out) that we started figuring revising Aristotle’s theory. This happened well before the middle ages, by the way, and is when we moved the center of consciousness from the heart to the brain.
Scott, you should look up a bit about Galen. His joint contributions to philosophy and medicine sound right up your alley.
The sudden jump to Galen at the end is a bit odd since your first two paragraphs are about people hundreds of years before. I guess Galen is the main source for that history.
Almost everyone had taboos on dissection. The Greeks weren’t special in this. Given the Egyptian funerary rites, it wouldn’t be surprising if they had weaker taboos…but Alexandria wasn’t culturally Egyptian. The anatomists of your first paragraph, Herophilos and Erasistratus, were born in Greece, as were the rulers of Egypt. Maybe local norms would make a difference, but Alexandria was a new city specifically for Greeks, so probably not.
By the time of Galen, dissections were not allowed in Egypt. Maybe that was Greek or Roman cultural dominance, but it’s pretty standard and not in need of explanation. Galen “did his homework” reading his descriptions of earlier dissections, but he still got a lot of stuff wrong. So if he had done dissections, the medieval physicians reading him probably would have made a lot of mistakes, too. Also, I think a lot of his mistakes could have been corrected by the dissections of pigs that he did do. He wasn’t a scientist learning from his experiments, but I don’t think he claimed to be, only to be passing on the wisdom of his predecessors. It would have been better if the writings of those predecessors had survived instead of his, but that probably wasn’t an option. His celebrity preserved his writing but didn’t suppress his sources.
“Center of consciousness” sounds a little anachronistic to me. I think a popular view in classical Athens that the intellect came from the brain and the emotions from the heart, to identify more concrete functions. Erasistratus’s argument that the heart was a pump and thus probably wasn’t anything else is a good argument. And maybe his study of neurons gave him good reasons to believe things about the brain, but it was a popular choice before him, perhaps for bad reasons.
Maybe the secret sauce was getting 2,800 people to make predictions somehow. Did other groups gather this number of people?
If you’re collecting a list of people who would use that program if released (to send to Lorch), put me on it.
+1 if we’re building lists.
How much should it help to add the name “Anonymous” to such a list?
I believe Scott has access to the emails which are attached (I comment using the same email regularly; also, I imagine this is the mechanism by which he bans users). Also, I imagine pure numbers are probably more important than specific routes to individuals for a “get them to release the code” request. Finally, it’s possible that the combox requests result in a more formalized process for collecting a large number of requests.
If I’m wrong, of course, then I apologize for the (now two) useless comments.
Me too.
The part in the response about it being worth his time sounds like a commitment to pay X dollars if it was released, might be more useful than just having a list of people who’d like it.
From the tumblr post, it sounds as though chances are low we’ll see the news-selection tool anytime soon.
Is there interest in that kind of tool generally, or is the value in the subjective bias scores, and sources from Lorch himself? If the former, we could probably put something together ourselves.
The value is probably in the subjective bias; if you’re ranking on ideological basis, that’s a very subjective judgement: what you consider a conservative versus a right-wing versus a foaming at the mouth extremist far-right (or vice versa) news source depends on your own position. To a very liberal person, what someone might consider “conservative but not extremist” could seem like “and in their next editorial they will call for sending LGBT+ people to re-education camps” (and again, vice versa).
So reading sources you ordinarily wouldn’t even use as fish-wrappers may tell you something you didn’t know, as well as the geographical spread. Again, reading Big City Coastal paper will give you the Big City Coastal view of the rednecks and rubes and the inconvenience they put the reporter to, blocking the traffic as he tried to get the Luas in to work and wasn’t able to stop at his favourite boutique cafĂ© for his morning artisanal hand-ground beans espresso, but The Farmers’ Journal will tell you why the dung-smelling protestors on tractors are parked in front of the Department of Agriculture, in plenty of detail related to the CAP price support mechanisms that the big city readers don’t know or care about.
Yes, the whole idea is to gain ideological diversity – there’s not much left if you take that out. I see that wasn’t quite as clear as I thought in my initial post.
What I’m curious about is whether the SSC crowd is specifically interested in a superforecaster’s source list or the general ability to ensure a diverse news/information intake. If it’s the former then we’ll need to specifically recruit someone interesting (like Elissa), if the latter then I can just put something together on github (though of course any and all interesting people would be welcome to contribute).
It is true that in general placing a publication in the space of possible ideological biases is difficult, and would be one of the issues a project like this would have to tackle, but it’s not insurmountable.
Certainly the source list would be interesting; does Tetlock mean “Douglas Lorch used to only read The Mid-Western Gazette but when he included sources he previously thought were the Devil’s mouth-organ he found this improved his knowledge of the world” or does he simply mean “He read other media sources besides the New York Times for the first time in his life”?
I have to admit, though, my view is coloured by the reaction that phrases like “using criteria that maximise diversity” evokes in me; it generally is used de haut en bas to tell the rednecks they should try expanding their limited view of the world beyond “The KKK Newsletter”, and if it ever is used in the context of the high-minded and right-thinking opening a book or paper that isn’t of their tribe, it’s in an anthropological tone of “With Gun And Rod Along The Suwannee: My Six Months Amongst The Republican-voters” and ends with the columnist or author gratefully fleeing back to the oasis of sanity and reality that is the Culture and Style Section of the NYT or the LA Times 🙂
Actually, given the kinds of predictions they were apparently making, my uninformed speculation would be that he had a lot of sources that he just didn’t care about before.
For instance, if he was doing a bunch of predictions about various African elections, he probably wouldn’t find much about it in his ingroup or outgroup news sites, he’d have to go off and collect what sources he could from around the continent, making sure they’re not all run by american expats or communists or whatever.
If I recall correctly, there is a very Nassim Taleb style piece of advice that you should abandon all news/information intake sources and read the classics, which ended up working out a lot better for me personally. Of course, I’ve started reading the news again now, but mostly for fun (there was a moment of amusement when I realized I read no establishment news sources for either the left or right politics, only radical sources).
I’m glad you picked up Tetlock on the bit about medicine 🙂 It really is phrased horribly and makes it sound as though in the 12th century there was really advanced science everyone else was performing correctly but the
witchdoctorsmedical profession were going “Yes, I know the bacterial culture demonstrates that this illness is caused by a staphylococcal infection, but I prefer to believe that it was caused by the patient wearing too much blue in his clothing instead of balancing it out with stimulating red, so dammit, that’s how I’m going to treat it!”I haven’t read this book, and frankly the excerpts aren’t making me want to rush out and get it – okay, so you had better outcomes to the tune of 60% and 78%? On what figures – 10 forecasts, 100 forecasts, 1,000 forecasts?
Otherwise, it sounds as if throwing darts at a newspaper would have been the better strategy for the competing groups, if your group was getting so much better results. It also sounds as if Tetlock is indulging in a little of the behaviour he finger-wags at in others:
“So I had this notion that if everyone knew everything, they’d be a lot more confident in saying “definitely” or “no chance”, and so I told my forecasters to push their estimates up higher or lower than they originally picked”. Isn’t that a tiny bit “hedgehog” of him? The One Big Thing he knows is about “the wisdom of crowds” so he applies his own experience and opinions to telling his forecasters how to ‘tweak’ their calls?
discovered that superforecasters lost accuracy in response to even the smallest-scale rounding, to the nearest 0.05, whereas regular forecasters lost little even from rounding four times as large, to the nearest 0.2.
This was the part nobody on the comments to the last post believed, and I have trouble believing it too.
No, that makes sense. If your superforecaster is giving a probability of 21% to “Will Martians land on the White House lawn in 2020?” and your regular old flawed forecaster is giving “30%” and your slightly better forecaster is giving “25%”, then rounding up makes the superforecaster and the slightly better forecaster both “25%”, and rounding up to the tens makes them all “30%”.
If we get to 2020 and no Martians, we can say “Dumb old flawed forecaster! They should have gone for 5%!” but the superforecaster’s original estimate would have been less wrong (heh) and so treated as more correct at 21% – “sure they got it wrong too, but not as badly wrong, they were 9% or whatever more accurate in their estimation”. So if you’re ranking “Who was the best?”, the superforecaster at 21% comes out top of the table. But rounding them all up to 30% makes them equally dumb in estimating the chances, and badly knocks the superforecaster down while it only slightly hurts the less bad forecaster and doesn’t affect the flawed forecaster.
“Superforecasters did better, placing higher than about 80% of the population.”
I find this IQ estimate implausibly low if “the population” is referring to “the entire American population.” I would guesstimate the consistent superforecasters are >95%, maybe 98th percentile on average.
You have to go through an online audition just to be invited to participate in the Good Judgment Project at all. You list your advanced degrees and take a test. Tetlock and company went out of their way to publicize their contest on high end blogs and publications. For example:
“Philip Tetlock requests your help
“by Tyler Cowen on August 3, 2011 at 5:28 pm in Economics, Political Science | Permalink
“He is one of the most important social scientists working today, and he requests that I post this appeal: …”
http://marginalrevolution.com/marginalrevolution/2011/08/philip-tetlock-requests-your-help.html#sthash.FPPwrSqn.dpuf
I passed the audition and qualified to participate in the Good Judgment Project, but once I looked at the questions, I realized I wasn’t smart enough to do well without putting in a huge amount of work, so I didn’t participate further.
A lot of the questions are like: “Will Anton Boratsky still be Prime Minister of Slovakia at the end of the year?”
That requires reading up on Slovakian political history, constitutional structure, and current events, and creating a modest quantitative model of whether there will be an election this year, and, if so, will Boratsky win it. Also, what are Boratsky’s chances of not dying over the course of the year?
And then I have to stay current on Slovakian affairs to update my probability every week or so.
And multiply this kind of thing by scores of countries.
And then there are a large number of questions that have to be modeled in a different fashion.
So, you need to be able to learn a lot of obscure factual information quickly, apply it logically, and think quantitatively about it. That’s pretty close to what intelligence tests are good at measuring.
Tetlock’s finding of a few score superforecasters who could and would beat CIA analysts largely for the fun of it is reminiscent of what Bill James showed with baseball statistics in the late 20th Century: there’s a lot of analytical talent out there, and most of it isn’t found among baseball lifers.
A community of sharp guys talking to each other about how to think better about subjects that have traditionally been considered the domain of professionals can make big strides.
I’m reminded of how Scott has put together quantitative analyses of psychiatric drugs that nobody has bothered to do before:
http://takimag.com/article/moneyball_for_medicine_anyone_steve_sailer/print#axzz3zMdUB3rR
I’d like to see the sabermetrics attitude spread widely.
To synthesize my two comments, there are 15 million people in the top 5% of general intelligence in the US, 6 million in the top 2%. A fair number of them are kind of bored at work.
There’s a lot of data available on the Internet on a whole lot of problems. We’ve seen that a few tens of thousands of smart baseball fans kicking statistics around can make progress. So, it’s not surprising that Tetlock recruiting foreign affairs junkies and teaming up the most reasonable forecasters makes them even better at forecasting.
I’m interested in what other fields could be moneyballed? I’ve suggested before that Scott has the talent to be the Bill James of psychiatric drugs, and that might be able to do a lot for human happiness.
In more zero sum fields, urban gentrification would seem like a field where data junkies could team up profitably. I suspect the market for undervalued homes isn’t as efficient as the ones for stocks, so money could be made off building models for optimizing real estate investments.
I’m willing to make a guess on the Boratsky one: yes, he will 🙂
Based on nothing more than “I’ve read nothing and heard nothing about turmoil in Slovakia this year yet, so the situation there may be roiling the Slovakians but not enough to attract international attention from other EU members. So, unless stories about mass protests and calls for his head on a pike appear in the news, I’m going to say he’ll manage to survive”.
I mean, I’m estimating Enda Kenny will be returned to power as Taoiseach after our just-announced election, even though the prospect thrills me even less than my dental appointment to get work done in a fortnight’s time.
You saw good solid evidence that disagreed with your prior and your response was to not update at all? Then you posted about it in the belief that–what–your bare assertion would convince others you are right and the study is wrong?
Did you become disoriented while wandering the internet and end up here by accident?
I think it’s fair to hold off updating on a piece of evidence until you’re sure you understand that piece of evidence.
Sailer is not defying the data here, he’s just wondering if “the population” might mean something like “the population of college students”. The quote doesn’t really give enough context to say for sure.
Given that (if I understand the original experiment correctly) the participants were drawn from the ranks of college students, then yes we’re not talking about the general population so Sailer is entitled to provide his own estimate.
I don’t think superforecasters were in any way drawn from the ranks of college students. What makes you say that?
Right, they were drawn from college graduates, not mere students, at least according to the initial announcement.
Okay, got that wrong, apologies.
“You saw good solid evidence that disagreed with your prior and your response was to not update at all?”
Right. I thought it was more likely that the undocumented assertion about superforecasters being at only the 80th percentile in intelligence was wrong than that everything else I know would be called into doubt.
And it turns out my skepticism about the assertion made in Tetlock’s mass market book was valid.
They came up with an average IQ of 115 for the Superforecasters somehow.
It looks like this paper has Raven’s APM-SF scores for superforecasters, top-team individuals, and all others (table 2, page 274). I can’t easily find a table linked APM-SF scores to IQs (and I would expect the transform of the average to be different from the average of the transform) but I don’t see too much reason to doubt the estimate of 115. (One might say Raven’s is a bad subtest for this sort of thing, but they also do some abstraction and vocab tests.)
You may not have seen it, but Keith Stanovich has done a lot of research on “RQ,” the rationality quotient, and found it’s only moderately correlated with IQ. It measures the sort of intellectual humility that makes one good at being a forecaster (because one’s beliefs are agile), where IQ seems much more closely related to the sort of intellectual ability that makes one good at being a lawyer (where beliefs are fixed and arguments are agile).
No, that paper does not say 115. It says at least one standard deviation.
That table does show that the superforecasters were barely smarter than the general forecasters, but the pool was pretty smart. According to this a Shipley-Hartford vocab score of 38 corresponds to an IQ of 131. The whole pool had a (verbal) IQ of 127, and the superforecasters were 130.
But it looks like everyone did worse on the abstraction subtest. I’m not sure how much worse, though, because Shipley-2 switched to a 25 point scale from a 40 point scale, so I don’t have norms. And it’s possible that my numbers are wrong because I shouldn’t be applying the Shipley-Hartford norms to the Shipley-2 test.
“The whole pool had a (verbal) IQ of 127, and the superforecasters were 130.”
That sounds much more plausible. 130 is two standard deviations up, or a little under 98th percentile. Perhaps more informatively, Tetlock’s system of recruiting (e.g., by asking Tyler Cowen to promote his GJP on Tyler’s Marginal Revolution blog) and his admissions audition meant that the entire pool of participants averaged a verbal IQ of 127, which is Ivy League level.
Now, it could be that people who are good at holding intelligent opinions about world affairs are not as good at or don’t much like the nonverbal Ravens IQ test puzzles. I don’t know.
But having looked at the project’s questions, which I found daunting, I would think that most people who are consistent superforecasters would score very well on Wechsler IQ subtests such as information, vocabulary, arithmetic, categories, and perhaps some of the more nonverbal logic subsections.
My impression is that superforecasters tend to be the kind of people who find Tom Friedman lowbrow — thus, Tetlock’s book begins with a long section making fun of Friedman. And yet Friedman is well above the 80th percentile in intelligence.
I haven’t read Tetlock’s book and don’t know much about his project, but I’m mystified by these claims in the book that the mean IQ of the superforecasters wasn’t high or that IQ was a poor predictor of prediction accuracy. This is because Tetlock has published a paper where exactly the opposite is claimed about what I understand is the very same project. Tetlock et al. wrote:
This would suggest that the average IQ of the forecasters was maybe 120-130, probably close to 130.
If you look at the predictors of accuracy in Table 3 in the same paper, you’ll note that IQ has the highest total effect, a correlation of -0.54 (smaller Brier scores are better, so the sign is negative). This would suggest that the top two percent aka the Superforecasters probably have an average IQ north of 140, assuming a normal distribution.
The Shipley vocab test has 40 items, and the superforecasters got 37.5 right on average, the other forecasters ~36.8. It seems obvious that many people hit the ceiling of the test, so the results probably underestimate both the average IQs of the two groups and the gap between them.
On the short-form Raven the forecasters apparently scored about 0.58 SDs above a random sample of University of Toronto students, if the SD from this study is used. Given that the University of Toronto appears to be pretty selective, this is consistent with the forecasters having a mean IQ somewhere near 130.
“Superforecasting” is a book aimed at frequent fliers, like Freakonomics or The Tipping Point, not at an academic audience. If it became a bestseller, the median buyer would likely have an IQ around the 80th percentile.
It’s in the interests of the co-authors and publisher to downplay just how intellectually difficult the foreign affairs forecasting questions were. Literally, the whole world outside the United States was the subject matter. Some of the questions were about subjects I, a full-time professional pundit, had never heard of before, such as the Spratley Islands dispute.
The book blurs somewhat just how recondite the subject matter of the GJP was. That’s not unreasonable. Tetlock has lots of academic publications that give details. This general audience book was hoping to lure in readers who will never ever have enough room in their brains to care about the Spratley Islands on top of everything else going on in their lives; but they might find some of the tips helpful on the job or playing fantasy football or whatever.
It’s quite possible for somebody with a 115 IQ to read “Superforecasting” and learn valuable techniques that will make them better at forecasting on subjects they actually care about.
Here’s an example: I could imagine Peyton Manning (Wonderlic test IQ ~ 118) reading “Superforecasting” and giving it to his business manager to read, and the two of them profitably using methodologies in the book to help them choose which fast food chains to buy franchises in. However, I can’t imagine Peyton Manning becoming fascinated in the Spratley Islands and the like enough to become a foreign affairs superforecaster in a future Good Judgment Project.
All four presidential elections in this century have been relatively predictable on a state by state basis by a single demographic statistic: the average number of years a white woman between 18 and 45 is married. E.g., Utah has by far the highest years married stat, and it’s always highly Republican. Massachusetts is at the opposite end and is always highly Democratic.
However, that correlation isn’t all that useful at predicting who will win the next election. States in the middle on years married like Ohio can tip Republican or Democratic without disturbing the remarkably consistent rank ordering.
I was going to say on this, the “actually that’s not so impressive since only two states flipped” claim is pretty poor. It’s obviously far from guaranteed that 96% of states are going to vote the same way as the last election. That factoid is just restating what we all already knew – Obama won both his elections by pretty similar margins.
If you were to use the no change methodology in 1980 rather than 2016 you’d end up looking pretty dumb. And while we’re unlikely to see a Reagan style wipe this year, a Republican victory is certainly possible and if it happened it would involve a few more than two states flipping.
But that doesn’t necessarily mean the rank ordering would change. You could conceivably see a 538-electoral-vote blowout without the rank ordering changing – Utah going 80% Republican and Massachusetts going 50.1% Republican.
In fact, 538 has done analysis of this kind (based on polls not marriage stats, of course); they called it “tipping point” analysis under the premise that if a candidate won state X he’d almost certainly have won state Y more decisively.
Yep I’m not arguing with Sailer’s point, just the claim Scott referenced in his post. Identifying where the tipping point will be ahead of time is not trivial. A “no change” model is better than choosing at random but usually won’t do as well as in 2012
I find the granularity result very intuitive.
What if I framed it like this: “The forecasters followed a process that results in a probability figure. We then adjusted this figure to whichever number was nearest that happens to look pretty to humans in an arbitrarily chosen base (10, in this study). Surprisingly, this adjustment reduced the accuracy of the assessed judgements. Future research should study whether pretty-looking binary or hexadecimal numbers get better results. “
I agree with Dan’s take. As a trivial example, when I was in school and we were learning about estimation, we had to estimate the length of a desk. I estimated it at one and a quarter meters. The teacher asked for decimals, so I said 1.25 m. That was “too precise”, so it got rounded to 1.2 meters. It turned out to be 1.24 meters.
If the superforecasters were calculating probabilities using fractions, you can easily end up with a number which looks very granular in base 10, but may be a lot less granular in a different base.
Additional note about Normandy – there was massive and very interesting disinformation campaign that delayed response by weeks – see https://en.wikipedia.org/wiki/Operation_Bodyguard
For example Double-Cross System – “After the war it was discovered that all the agents Germany sent to Britain had given themselves up or had been captured, with the possible exception of one who committed suicide.” + “Later agents were instructed to contact agents in place who, unknown to the Abwehr, were already controlled by the British.” – entire German spy network in Britain worked for UK! It resulted in very interesting optimization problem – after all, captured spy network should not be reporting total nonsense, but giving useful information to enemy during war… See for example https://en.wikipedia.org/wiki/Double-Cross_System#V-weapons_deception
The remarkable failures of Nazi intelligence during WWII suggest that perhaps highly intelligent Germans stayed out of Nazi intelligence.
Or perhaps what they called ‘eugenics’ at the time led to a different loss of intelligent Germans.
Note that many loses of spies in Britain (and USA) were defections. As in – smart people used Nazi intelligence to escape and help their enemies.
It’s kind of impressive how awesome the British Intelligence Services were during WW2 given how utterly useless they were in the Cold War.
The British Intelligence Services were consistently awesome at performing their mission, which during WWII and the Cold War both was, “Deliver valuable intelligence to Moscow and its allies, and confound the enemies of the Soviet Union through misinformation, espionage, and counterespionage”.
Um, is this some kind of joke? Can you elaborate? I’m aware they had some high profile moles like Kim Philby (and the CIA had Aldrich Ames, and KGB had Penkovsky, but the British probably had more). Is that what you meant?
Essentially yes. Philby was a mole of a higher order than any of the rest; being head of counterintelligence for MI6 basically meant that all of Moscow’s other agents could operate with impunity in the early Cold War era. I don’t think anyone, even in Moscow, knows how thorough the infiltration was at this point.
But note that this really applies only through the mid-50s, after which MI6 seems to have been effectively de-moled and upgraded in competence. Hmm, right about the time James Bond showed up…
The British intelligence services worked very well in the Cold War – all the Oxbridge communists were highly effective at collecting information on British government and military preparations and passing it to the Soviets – they just weren’t on Britain’s side. Not a problem in WWII as very few Oxbridge graduates were sympathetic to National Socialism.
I think this is pretty much the consensus, although there have been too few public discussions about it for obvious reasons. An actual WMD program costs money and could attract raids and possibly even an invasion. On the other hand, a possible WMD program allowed Saddam to preserve the regime’s dignity in the face of otherwise humiliating involuntary UN inspections (basically part of a surrender agreement, remember) and served as a deterrent to regional rivals if not the United States. Many senior Iraqi officials, civilian and military, actually believed their government had a chemical/biological program, if not a nuclear program, and this belief was detected by American intelligence. The manufacture of “standard” (i.e., no extensive testing of novel agents or delivery systems) chemical weapons is also actually pretty hard to distinguish from legitimate industrial chemical production. Plus, there were probably still some old chemical munitions sitting around.
To be less charitable, there were definitely influential people in the American government–Paul Wolfowitz probably being the most notorious–who saw the “Iraqi WMD program” as a convenient pretense to start the grand neoconservative reconstruction of the Middle East and might have even [bad faith assumption alert] been willing to regard Iraqis WMDs as a noble lie even if they had good evidence against their existence.
I thought at the time that the only reason we got so much noise made about the WMDs was that:
a) The only grounds that could possibly get a majority of the UNSC members at that time to support was WMDs (despite the fact that any resolution would surely be vetoed by Russia and China), and
b) Tony Blair, for domestic and particularly intra-party reasons, needed such a majority, equivalent to the UNSC support for the NATO bombings of Yugoslavia over Kosovo.
Similarly, we heard about democracy mostly because if you’re trying to sell the US public on a war, for the last hundred years you’ve sold it as a war for democracy (and never mind fighting on the side of the Tsar, or Stalin, or local idiots in Korea and Vietnam, or the contortion where “well, the Kuwaiti royals agreed they’ll be more democratic if we get their country back for them”).
The actual reason for the war was, I thought, a simple enough matter of realpolitik.
1) The previous twelve years of escalating Al Qaeda attacks, culminating on September 11, 2001, had thoroughly proved that US troops on the soil of the Arabian Peninsula were a provocation to militants that resulted in thousands of American civilians killed. Therefore the US had to withdraw forces from the Arabian Peninsula.
2) Though weakened by its loss in 1991 and subsequent sanctions, the Iraqi regime still had enough military power to roll over the Arabian Peninsula against only local opposition, which would have put Saddam Hussein in charge of half the world’s oil reserves (and the better half at that). Against such economic power no system of sanctions could survive, and Hussein would then be in a position of greatly enhanced power (one from which he could, for example, freely pursue his old nuclear ambitions on a greater scale than he could back in the 1980s) after having hated the US for a decade.
3) Therefore, it was necessary for US interests to destroy the Hussein regime and Iraqi army as part of the process of US forces leaving the Arabian Peninsula. As a result either Iraq would have a US-chosen central government that wouldn’t particularly mind having turn around and knock off the medieval Saudis, or Iraq would fracture in internal struggles such that it was no longer a threat to the Arabian Peninsula; both possibilities would serve better than the status quo.
So, based on this, I thought that the actual mission in the minds of the people who made the decision to start the process of going to war, was indeed accomplished by the carrier landing/speech on May 1, 2003. But none of the things the administration had talked about (finding WMDs, establishing democracy) in order to sell the war were, so it couldn’t, as a political matter, just pull everybody out as quickly as possible.
Fascinating, except what would be the point in Saddam claiming all that economic power, without the military power to back it up? Wouldn’t that just lead to, well, what happened?
The trouble with a war after Iraq rolls over the Arabian Peninsula is geography. In the Gulf War, the troop buildup was in Saudi Arabia. In the Iraq War, the troop buildup was in Kuwait. Assuming Iraq rolled over the whole Arabian Peninsula, where would the US/UK build up their forces for the invasion?
Iran and Syria hardly were going to allow US troop buildups in their countries, whatever their desire to see Hussein knocked down. Jordan opposed the Gulf War; they might be able to be convinced to be used as a staging area, but it’s hardly certain. Similarly, Turkey neither participated in nor allowed ground forces to operate from its territory in either the Gulf War or Iraq War, though it allowed air units in the Gulf War. That would leave . . . . . . a difficult amphibious operation across the Red Sea from Djibouti or something, or from Diego Garcia direct across the Indian Ocean.
Anyway. It would be uncertain there would be any ally from which to launch a ground invasion, amphibious operations are just plain difficult, and a 1991-style destruction of the Kuwait oilfields extended to the oilfields of the whole peninsula would be a major shock to the world economy. Which might be enough to convince Hussein he can pull it off, even if the US leaders privately know they would 100% do whatever it takes to dislodge him in such a case.
So, assume you estimate, say, a 20% chance of Hussein taking the gamble that it would be too difficult to dislodge him after he seized the peninsula, with all sorts of bad consequences even if you do remove him in response. Isn’t it tempting to reduce that to 0% with a short war with few allied deaths?
If I’m right, the mistake that the Bush Administration made was not military, but political. Rumsfeld, the SecDef who was a former SecDef; Cheney, the VP who was a former SecDef; Rice, the National Security Advisor who was an academic expert in foreign policy; and Powell, the SecState who was a general — they put together a perfectly competent military plan to solve two major foreign policy problems. Which worked; in five weeks and with 172 Coalition military fatalities, Hussein was ended as a threat and the US could get its boots off the holy soil of Arabia permanently.
But to launch their short, effective little war, they needed to sell it to the public. And the way they sold it, the lies that were told (thought of as just exaggerations; Hussein was a psychopath, WMD seemed to be real, and we’d certainly make the postwar government nominally democratic for a while) dragged the US into a long, expensive quagmire.
Thanks for convincing me (yearly prediction post) to write my predictions along with noting how sure I am about them.
Due to privacy issues and fact that I am predicting things in my personal life rather than stuff like “Will Turkey impose capital controls before 1 December 2016?” I keep it in a local Calc file rather on PredictionBook. It turned to be quite useful – it not only proved that I am overconfident and more overconfident than expected (AFAIK typical result and it still surprised me) but by proving that so many my “it went exactly like I expected” are wrong.
Also, it already influenced some my major decision be eliminating ridiculous difference between how I acted and what I believed (there was a low-cost intervention that reduced risk of high-cost event that I believed to be possible but I did nothing to prevent it before writing prediction entry and noticing discrepancy).
You’re already anonymous with a fake email. Can you be specific about your personal-life predictions and how they motivated you to change your behavior? Unless it’s something really rare that might make you personally identifiable despite anonymity.
OK, this one conveniently is not private.
I suspected that I have some medical problem and did nothing about it – then I created entry like “there is 20% chance that pain in my leg is not hypochondria”.
It motivated me to make call and go to doctor the next week – after all even low chance of health problem warrants a doctor visit (especially one that is free).
It turned out that it was not hypochondria and reducing risk of serious problems is doable.
In the end I traded 20 minutes of travel for reducing risk of potentially serious medical problem.
Sadam, before he was ousted, was doing a delicate balancing act. He had to convince his supporters and regional rivals that he had nuclear weapons so that they would support or fear him, respectively. And he had to convince the western powers that he didn’t so they wouldn’t depose him. Saddam had always been super paranoid about the CIA so he just sort of assumed that they would naturally figure it out. But really the CIA had access to defectors from Saddam’s followers who all said that Saddam had told them that he had nuclear weapons and was successfully hiding them from the inspectors. And that’s who the CIA believed.
A long time ago Kissinger wrote a letter to someone taking over for him about the dangers of having access to classified information. You might hear from some outside academic that was very smart and had good information on a topic but you’d be tempted to just discount anything they said because they didn’t have what seemed like key facts that you did. So while objectively the UN inspection teams were providing better information than the defectors the later information was TOP SECRET and counted for more with the CIA.
Even when the US was preparing to invade Saddam continued to think that the US knew the truth and that they just wanted to depose him. But if that had been the case you wouldn’t have seen, e.g., Bill Clinton go to bat for Bush on the WMD topic. There’s a good chance that Bill Clinton himself didn’t have anything to do with the mysterious missing ‘w’ keys from all the Whitehouse keyboards when the transfer happened but I think it’s safe to say that he wouldn’t have lied for Bush about the topic. Before he spoke up I’d been very skeptical about Bush’s assertions.
Here is the story about the danger of classified knowledge. (It’s advice from Ellsberg to Kissinger, FWIW.)
Thank you, that’s exactly the story I was remembering even if I misremembered that part.
That sounds like the exact opposite of the truth. What defectors? What the frak are you talking about?
But if that had been the case you wouldn’t have seen, e.g., Bill Clinton go to bat for Bush on the WMD topic.
Don’t know what you mean here either. But one might certainly think his remarks in 1998 meant that Kamel’s testimony was not the end of the story, if one had forgotten that Bill while in office had political reasons for not wanting the WMD-related sanctions against Iraq lifted. One might also naively predict that if Colin Powell repeatedly contradicted his intelligence on Iraq, and then blamed his mistakes on the intelligence community, they would call him on it.
Bill Clinton on Larry King, 6 February 2003.
I love this. Almost makes it a shame that I never intend to get married. That must make for some interesting conversations.
Prediction: Family members who disapprove of a given marriage will be less hostile in the long term, after being asked for their opinion in this way.
What’s the divorce rate like among the couples known to have done this? I assume the sample size is too small to draw useful conclusions but it would still be interesting to know.
I don’t know if it’s a good way to have a successful marriage, but it sounds like a great way to end some friendships.
“So Joe thinks I’ll never make it work because I’m an uptight control freak and won’t be able to resist trying to micromanage my spouse’s every movement? Well, guess who’s not getting invited to the wedding reception, Joe!”
I would guess that the set of people who are likely to ask that question and the set of people who are likely to react badly to getting unflattering answers don’t overlap much.
This more or less happened to me, lol, back before I realized “telling the truth” is a horrible way to interact with people. Explaining why you respond the way you did, incidentally, makes is much worse.
> (and that sentence would also have worked without the apostrophe or anything after it).
Dude, come on.
On the note of modern medical research I’d like to again signal boost the COMPare project.
http://compare-trials.org/
Ben Goldacre (author of the book Bad Science) and a team (http://compare-trials.org/team) recently set up a website for tracking improperly switched/unreported/added outcomes in clinical trials.
I believe they’re doing a highly valuable piece of meta-science with the project and the reactions of some (supposedly respectable) journals have been surprisingly poor while others have published corrections.
So far of 67 trials checked 9 were perfect, and in the remaining 58 trials a total of 301 prespecified outcomes were silently not reported and 357 were silently added .
58 letters have been sent to journals about switched outcomes of which only 6 have been published.
It is a major problem in modern clinical trial publishing since outcome switching destroys statistical validity.
Part of the challenge of getting public support for fixing the problem is that in order to understand why it’s such a big problem people have to wade through a great deal of tedium though I have a feeling that the crowd here won’t have much diffculty.
This is a great initiative. Fingers crossed it will bring about some change. Thanks for sharing!
So I took part in the original Good Judgement Project, and I have some doubts about whether it really proved anything. They did an open sign up over the internet, and promised some token amount (I think it was $50) to anyone who stuck through it, regardless of your accuracy. I was unemployed at the time, so $50 for answering a few questions on the internet sounded good.
What I didn’t realize was how *many* questions there were, and how boring they would be. You had to answer something like 5 per week, over a period of several months. And they were not interesting questions that you’d have an opinion on… most of them were very obscure, typically involving elections in some small country that ordinary people would never hear about. It was way too much effort to actually research all these topics and come up with a solid prediction, so after a while I started just blindly guessing. I can’t imagine I was the only person to do that.
My guess is that the “super forecasters” were mostly just people that got hooked on the “game” for whatever reason and put a lot of effort into it, while most other people were just going through the motions and making blind guesses to get the money. IIRC the second season of the game had even *more* questions required, to the point where I couldn’t even get through it.
>The Wehrmacht served a Nazi regime that *rpeached* total obedience to the dictates of the Fuhrer, and everyone *emembers* the old newsreels of German soldiers marching in goose-stepping unison…
“preached”, “remembers”.
Uniersity → University; intellgience → intelligence; signficant → significant; betteri n → better in; oreintation → orientation; controve → contrive; rrom → room; circustancse → circumstances; re[prted → reported; retunred → returned; rpeached → preached; emembers → remembers; witht → with
So how has your opinion as expressed there changed (in broad terms)?
The fact that “extremizing” seems to work so well might simply be a corollary of Aumann’s Agreement Theorem. Since the “superforecasters” seem to be more than typically rational, they may be decently modeled as rational actors, and the people collating their predictions into a single prediction should therefore be able to assume that if the entire group held a discussion about each question, they’d generally tend to reach an agreement on their prediction, which would probably be in the same direction as the “average” prediction prior to the discussion.
“In the late 1940s, the Communist government of Yugoslavia broke from the Soviet Union, raising fears that the Soviets would invade.”
This section is from the CIA’s classic Words of Estimative Probability.
“He concludes that they’re effective when low-level members are given leeway both to pursue their own tasks as best they see fit,”
The US military uses the Five Paragraph Order format, probably the most important piece of which is the Commander’s Intent, which is the guiding factor for operations. When everything else goes to shit, follow the CI and you’l be ok. It’s like Nelson’s orders at Trafalgar: “No captain can do very wrong if he places his ship alongside that of the enemy.”
” The information Bowden provides is sketchy but it appears that the media [sic?] estimate of the CIA officers – the “wisdom of the crowd” – was around 70%. And yet Obama declares the reality to be “fifty-fifty.” “
Apparently we need to dispel the myth that Obama knows what he is doing.
That’s way too harsh. What Obama said makes a fair amount of sense in context:
“What you ended up with, as the president was finding, and as he would later explain to me, was not more certainty but more confusion…in this situation, what you started to get was probabilities that disguised uncertainty, as opposed to actually providing you with useful information…”
This is a good point; when someone says “70%” it matters very much what the error bars on that estimate are, so to speak. If the meta-uncertainty is high enough, it might as well be 50-50.
And anyhow, even if I’m wrong, it’s downright silly to blow one mistake up into a “myth that Obama knows what he’s doing.”
I believe kaninchen is referring to Marco Rubio’s poor performance in the last Republican debate, in which he stated four times with nearly identical wording that we must “dispel with” (sic, first two times) the “myth” that Obama *doesn’t* know what he’s doing.
More importantly, this is about an extemporaneous remark Obama made before sending people to Abbottabad to kill bin Laden.
One thing nobodies mentioned so far is that people consider the consequences of each decision alongside the probability. If a doctor said you had a “fair chance” of having (treatable) cancer and wanted to perform exploratory surgery, would knowing that “fair chance” meant 10% versus 50% make a difference in your decision?
If the consequences of being wrong in either direction are the same then accurate probabilities are what count. But if the consequences are lopsided they matter less. This is often brought up regarding existential risk (AI, AGW), for good reason. Psychologically I think we’re always wired to think this way, whether or not its appropriate. Common language used to express probability includes it too I think.
“Tetlock mentions that even independent-to-hostile investigators concluded that [the US intelligence community] had been correct in using the facts it had to believe Saddam had WMDs.”
I’m having a hard time reconciling that claim with something else Tetlock writes:
“Postmortems even revealed that the IC had never seriously explored the *idea* that it could be wrong. ‘There were no red teams to attack the prevailing views, no analyses from devil’s advocates, no papers that provided competing possibilities,’ Jervis wrote… As the the presidential investigation of the debacle tartly noted, ‘failing to conclude that Saddam had ended his banned weapons program is one thing — not even considering it as a possibility is another.'”
This sounds to me very much like what you would expect if the conclusion was chosen first, then evidence sought to support that conclusion, as Karen Kwiatkowski has charged. (Kwiatkowski was an Air Force officer who served in the Pentagon’s Near East and South Asia (NESA) unit in the year before the invasion of Iraq.)
Jervis’s claim is that the CIA followed its procedures without political influence and sincerely reached its conclusions, but that the procedures were lousy.
Then Tetlock is misrepresenting Jervis’s report.
You might want to check out Metaculus.com, if you have not already. It’s taken a lot of cues from the Good Judgement Project research, but focuses more on scientific and technical questions that would probably be of more interest to this community. It’s very much a work in progress, and any suggestions (other than ‘but it doesn’t have enough color’) are welcome and of interest.
I feel like I must be missing something regarding the rounding/granularity – why is it surprising?
Rounding just adds noise. If you take any process that with predictive power and add noise, you end up with less predictive power in the result.
I’m assuming Tetlock’s analysis must be deeper than this but, unless I missed it, there’s no description of any more sophisticated statistics that would lead me to be surprised by this result.
Both superforecasters and ordinary forecasters have signal, so your argument applies to both of them. Mellers’s rounding reduced the Brier score of superforecasters but not of regular forecasters. Your argument failed to predict the observation, so you should be surprised.
No. Better forecasts will be more strongly affected by noise than worse forecasts. And:
superforecasters lost accuracy in response to even the smallest-scale rounding, to the nearest 0.05, whereas regular forecasters lost little
but “little” is not “nothing”; so regular forecasters are affected by rounding but less so than superforecasters which is what I would have predicted.
So I continue to think that we need to see some math.
“Bier score” should be Brier score
“In the old days, people tried possible new medications in a very free-form and fluid way that let everyone test their pet ideas quickly and keep the ones that worked; nowadays any potential innovations need $100 million 10-year multi-center trials which will only get funded in certain very specific situations.”
I think it’s fairer to say that this is a regulatory and liability problem. Scientific medicine worked then and works now, but the range of experiments bureaucracies will permit you to perform is much smaller now. The government has been progressively outlawing the actual practice of science at the same time, and largely with the same laws, with which it has been progressively mandating the exclusive use of science’s methodology.
Agreed; the insanity of the current regulatory regime is orthogonal to the amount of evidence available. Holding the the distance between the fruit and the ground constant, if you combined current attitudes about measuring effects and looking ad evidence with past legal climates surrounding experimentation you should get no worse results, and perhaps better ones.
I am reminded of a great comment that Moldbug made on Harcker News:
It seems like anti-quack measures are only hindering legitimate research, because they are effective at stopping quacks. If they’re effective at stopping quacks, you’re not going to be seeing any quacks!
It’s like saying “nobody’s ever tried to rob that bank–so why do they need all those locked vaults and security guards?”
Furthermore, bringing up Steve Jobs in this context is odd because he had a rare form of pancreatic cancer that is often survivable if treated promptly, but he decided to wait nine months to see if alternative medicine would help him first. It’s not a smoking gun that instantly killed him, but quackery didn’t help and could easily be a contributing factor to his death.
How many people were being killed by quacks in the 19th century? More than now, I wager, but I’m sure it didn’t come close to the numbers being killed by Tuberculosis.
People generally have an interest in buying treatments that work and won’t kill them. The typical response is that anyone can sell hope to a dying man, but dying men tend to die anyway. Is it good to stop someone profiting unfairly from another’s misery, if the cost is retarding the progress of medical science?
And this is with the rather unfair assumption that no FDA means unlimited legal quackery. In fact, quacks who outright lie or materially mislead have always been guilty of fraud. The FDA is primarily about “protecting” people from making contracts that they genuinely, with full knowledge of what they are doing, think are a good idea, but which imply some risk for them. About banning your doctor from saying, “Look, this might not work, or might have terrible side effects, but it also might cure you. We can try it.”.
Re: granularity:
“This was the part nobody on the comments to the last post believed, and I have trouble believing it too.”
A lot of the questions have a ticking clock aspect to them.
Say the question is: Will Slobotsky still be prime minister of Lower Slobbovia on December 31 of this year? You check out Wikipedia and see that no general elections are required this year because he’s only been in office one year. You determine that historically Lower Slobbovian prime ministers have a 60% chance of making it through their second year. Of the 40% who don’t, half fell due to weakness apparent at the beginning of the year and half fell due to unanticipated problems that emerged during the year. On January 1, you see that Slobotsky seems to be in a strong position to maintain his grip on power throughout the year, so you guess he has an 80% chance of making it through the year.
If you are working hard, you will periodically update his chance. If he seems to still be in a strong position after 18 days into the year, you would upgrade his chances of making it to the end of the year from 80% to 81%. And you would repeat periodically.
Similarly, you’d make adjustments for news. For example, say you notice in late January that the New York Times has run an article that sounds, reading between the lines, that George Soros has taken a dislike to Slobotsky and may be funding anti-Slobotsky groups. So maybe you drop his chances from 81% to 79%. But then you read that Soros’s statements against Slobotsky have caused his approval rating in the polls to climb because Soros is unpopular in Lower Slobbovia due to the Slobbovian Meatball Corner of 1994, so maybe you change your prediction from 79% to 83%. And so forth and so on.
This is not to say that the chance of Slobotsky riding out the year is exactly 82%, just that you SWAGed it as 80% on January 1, and since then time has passed and you’ve learned new information, so you might as well nudge your estimate in the appropriate direction.
You get points in the GJP for every day, so it pays to adjust probabilities a few times per month.
Non-Superforecaster tend to either underreact to the passage of time and to new information because they aren’t paying attention or to overreact when they do notice. Superforecasters tend to stick to their baseline estimates and just nudge their forecasts in the direction of the new information.
I don’t understand how the probabilities work in these forecasts.
You mention a couple of times that forecasters who use more specificity in their probabilities were better forecasters, but how do you know? If one forecaster says an event has a 20% of happening and a second says it has a 28% of happening, and then it doesn’t happen, how do you measure who was more correct?
By the same token, how do you measure the accuracy of any of these forecasters? It has to be more nuanced than “most of the things you said were more than 50% likely to happen happened, and most of the things you said were less than 50% likely to happen didn’t,” but every other analysis I can think of seems to need people to look backwards and say “looking back, this thing that happened had only a 15% of happening – who forecasted 15%? You win.”
I assume I’m one of the people who doesn’t understand probability. 🙂 But I’d like to.
Say that I’ve made a hundred predictions, each of which has given a (different) event a 30% probability. So I might have predicted a 30% chance for a nuclear war next year, a 30% chance for the sun rising next year, a 30% of an alien invasion in a month, and so on.
If I was well-calibrated, then 30 out of those 100 predictions should have come true.
Similarly if I’ve made 84 predictions, each with a 50% probability, then I’ll be perfectly calibrated if 42 (50% of 84) of my predictions comes true, and so on.
If one forecaster gives an event 28% odds and another gives it 20%, then you can’t really compare their accuracy on the basis of that event alone. But if one of them assigned several predictions a 28% probability, and 29% out of those came true, then that’s better than the guy who assigned several predictions a 20% score and had 28% of them come true.
But we care about accuracy, not just calibration. Mainly people were judged on the basis of Brier score, not calibration. It is easy to become well-calibrated at the expense of Brier score, but that is usually a bad move. The better forecasters were both calibrated and accurate. Calibration was used for a number of purposes, such as the final adjustment.
Regarding that Obama anecdote and the Yudkowsky’s “Say it Loud”, where he says:
Well, I suddenly realized that I don’t understand how a good Bayesian is supposed to communicate their state of uncertainty.
For example, consider a situation: CIA performed a bunch of updates starting with a 50% prior and got a 70% probability, but if they started with a 10% prior they’d get 13% probability, and 93% for the 90% prior.
And another situation, where they had very good updates that would end them with 70%, 69%, and 71% probabilities respectively.
I think that that’s two very different situations, and that saying loudly that you’re 70% confident in something does not actually communicate your confidence.
I kinda really want to know that when you confidently say that the chance of the coin landing heads is 50%, whether that’s because you flipped it a thousand times and became reasonably sure that this particular coin is fair, or because this is the first time you see it and what you communicate to me is entirely your prior, and zero information about that particular coin.
And if there’s no accepted way of communicating that difference in the Bayesian framework (at least I’ve never seen anyone doing anything like that), then when you realize that someone’s 70% estimate is actually 71% their prior by mass, calling that a coinflip seems to be a step toward a greater understanding of the situation, while insisting that there’s nothing besides the 70% figure because your favorite tool can only produce a single scalar sounds pretty unwise!
I was a researcher on one of the other ACE teams. I think GJP’s work is very important and useful, but I’ll summarize a few points that put the results in appropriate context (several of which have been noted by other commenters).
First, we don’t really know why superforecasters were better. In the first year or two if the tournament, GJP’s design was very tightly controlled and randomized. However, once the Super teams were formed and the other ACE teams dropped, the whole program shifted its focus more from research to development. The Super teams were constructed with the highest performing individuals, but they also received different training than regular forecasters. They had special communication and news aggregation tools built for them, and (being incredibly smart people) built some tools for themselves. They had much more frequent and detailed communication with the researchers, and got to meet them in person. And, of course they were labeled “super”. There were no control groups with randomly assigned high performing forecasters to investigate these factors. Tetlock would agree with all of this, I think.
A huge, HUGE part of GJP’s success came from motivation and effort. Super forecasters simply made so much more effort than regular forecasters and other ACE teams’ forecasters. Like several orders of magnitude more forecasts and time spent reading up on questions. Before the training program was whittled down to the short 10 minute version, it was one or two hours long, plus another one or two hours of cognitive tests (iq, personality, cognitive style, knowledge). There was a lot of attrition of people who signed up to participate, which is often neglected in the narratives describing the research.
If you read the published articles, you will find many behavioral correlations that attempt to describe why super forecasters were better, but fewer tightly controlled, randomized groups. That’s not Tetlock’s fault, but I think what the funders at IARPA wanted – see how accurate you can get, then try to explain why post-hoc. Maybe a reason they’re not quite as deserving of praise for doing this whole project. More patience and funds for more RCT type designs would have given us more knowledge about why GJP was so good. Of course, there is a limit to any government agency’s patience, and the ACE program was still a huge success.
As others have pointed out, the granularity result is directly caused by the greater effort of the Superforecasters. As in Tetlock’s comment about predicting 48/50 states, something like 80% of the questions resolved as status quo. Most world leaders stayed in power, most incumbents won reelection, most commodity prices rose or fell at about the same rate. So 24% doesn’t mean that a person thinks the probability is exactly 24%, but that it’s slightly lower than a day or two earlier when they said 25%, whereas someone who is less active might say 30% one day and 20% a week or two later, rather than hitting the incremental points in between. The scoring system – taking the squared difference between a forecast and the outcome and averaging over all days, and carrying forward old forecasts each day – mean that updating frequently gives one a large advantage.
In my studies of economics, economic history and history of economics, I came across the story with Keynes’ rebuttal about changing his views.
It was apparently nor about opinions based on facts but about a change of principles. Specifically, his attitude towards free trade – of which he was a principled proponent most of his life.
Having flexible principles is somewhat different than being open to experience.
This link now seems to be broken.
> If Cochrane had (truthfully) told them that the cardiology group was doing worse, they would have generated the meta-level principles “Cochrane’s experiment is flawed” and “if one group has a slight survival advantage that means nothing and it’s just a coincidence”. In some sense this is correct from a Bayesian point of view…
Yes, skepticism is an appropriate response to very surprising evidence. Not as much skepticism as they brought to the table, so his trick wasn’t an inappropriate thing to do. Definitely a dark arts move, smacking them with an admittedly statistically insignificant result.
Your ideas are intriguing to me and I wish to subscribe to your poetry journal.
“I am…optimistic that smart, dedicated people can inoculate themselves to some degree against certain cognitive illusions. That may sound like a tempest in an academic teapot, but it has real-world implications. If I am right, organizations will have more to gain from recruiting and training talented people to resist their biases.”
Is this something different from what humans have been doing throughout the course of history? It kind of sounds like that’s the import since it has a sort of brave-new-world ring. If we imagine it’s something different, how do we go about proving that “smart, dedicated people” of past eras were not “inoculat(ing) themselves to some degree against certain cognitive illusions?” Is it mainly because we view ourselves as having come up with a concept novel to the course of human history (i.e., cognitive illusions)? Without further careful investigation into the historical record with attempts at identifying in past thinking a concept cognate to this modern cognitive-illusions one, I”d be reluctant to assent to this sympathy. If humans have been more or less doing this sort of innoculation throughout the course of history, on the other hand, then one of the book’s key premises looks a bit suspect.
For what it’s worth, I’d say Silver’s most impressive accomplishment isn’t predicting which party/candidate would win each state, but that 48 of 50 states had percentage share gaps (between winner and second place) within his 95% confidence interval by state. Literally could not have been better calibrated.
I participated in one of Tetlock’s forecasting teams, so maybe this can shed a little light on why they succeeded.
I really liked Tetlock’s book Expert Political Judgement and I’m in a field that frequently has to make similar types of predictions. So when Marginal Revolution mentioned there was this contest I got really excited and signed up for Tetlock’s group. I was interested in testing my skills (obviously amazing!) and seeing what I could learn that could be applied to my job. A week or two later Tyler Cowen mentioned that Robin Hanson was setting up a group, too. He had a power point presentation explaining what their approach was going to be and it was extremely interesting. Really cool stuff. However, since I had already signed up for one team I couldn’t switch teams. I think one of the big reasons all the other groups underperformed and dropped out was that Tetlock had a far bigger group (I suspect, although don’t know for sure) since he announced first and had the star power. People who are excited about making accurate predictions probably know who Tetlock is and want to work for his team.
Making predictions was a slog. There were so many frickin questions, and I knew nothing about most of them. At best I could google a few stories, read them, and then make a rough judgement on based on my cursory reading. I never felt like I had enough time to really understand the issues and perspectives. He points out that the best people constantly update their predictions, but that just makes the time commitment way worse. Like, of course updating helps. If it’s a 1% daily probability something happens, as you get close to the date it should go down if it keeps not happening. But that’s a pain in the ass and takes a ton of time even if you’re not incorporating new information like you should. I was lucky if I updated once. Turns out I did pretty good. Like, above average, which I considered a success given how behind I was and how little time I devoted to each question.
The next year was better because you could read other people’s opinions, which made it easier to synthesize the arguments and vote quicker. But, I was burned out and dropped out partway into the year. I’m sure the reason that people who put 22% are way more accurate than the people who put 40% is because the “40%” guy is just guessing and it’s really low information. The difference in thought between average and devoted people is super high.
I think it shows a couple things. 1. It’s not surprising that most of the superforecasters had a bunch of time on their hands. He dismisses this at one point in the book, but I don’t think it’s fair. Being great takes a lot of time, and there are definitely rewards to simple diligence of 1-2 hours a day. It might not be a job, but it’s a full time hobby. 2. Tetlock had the critical mass to create the superforecaster groups. This allowed them to create a culture of people who devoted a lot of time to it, invested their ego in succeeding, and were spending time talking to people who had the right attitudes that were conducive to accurate forecasting.
A lot of organizations recognize that they succeed not because of their structure, but because of the people. I suspect the same thing is going on with Tetlock, where he had enough people who were excellent to make some strong teams. The aggregation and extremizing was the easy stuff.
This was the GMU approach I mentioned above.
http://www.overcomingbias.com/2011/08/join-gmus-daggre-team.html